Wikidata:Property proposal/OpenML dataset ID
Jump to navigation
Jump to search
OpenML dataset ID
[edit]Originally proposed at Wikidata:Property proposal/Authority control
Description | identifier for a dataset in the OpenML database of open datasets for machine learning |
---|---|
Represents | OpenML (Q52988856) |
Data type | External identifier |
Domain | items: data set (Q1172284) |
Allowed values | [1-9][0-9]* |
Example 1 | Iris flower data set (Q4203254) → 61 |
Example 2 | MNIST database (Q17069496) → 554 |
Example 3 | CIFAR-10 (Q45037095) → 40927 |
External links | Use in sister projects: [ar] • [de] • [en] • [es] • [fr] • [he] • [it] • [ja] • [ko] • [nl] • [pl] • [pt] • [ru] • [sv] • [vi] • [zh] • [commons] • [species] • [wd] • [en.wikt] • [fr.wikt]. |
Number of IDs in source | currently ca. 4500 |
Expected completeness | always incomplete (Q21873886) |
Implied notability | Wikidata property for an identifier that does not imply notability (Q62589320) |
Formatter URL | https://www.openml.org/d/$1 |
Applicable "stated in"-value | OpenML (Q52988856) |
Wikidata project | WikiProject Datasets (Q60003940) |
Motivation
[edit]OpenML (Q52988856) is a platform that hosts datasets pertaining to machine learning and facilitates their use for various purposes, including analytical, benchmarking and educational ones, so mapping dataset entries here to dataset entries there seems useful, and a dedicated property is perhaps the most straightforward way to achieve that. Daniel Mietchen (talk) 11:39, 23 September 2022 (UTC)
- WikiProject Informatics has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. --Daniel Mietchen (talk) 11:56, 23 September 2022 (UTC)
- Notified participants of WikiProject Datasets --Daniel Mietchen (talk) 11:57, 23 September 2022 (UTC)
Discussion
[edit]- Support --Dhx1 (talk) 12:13, 23 September 2022 (UTC)
- Support. YULdigitalpreservation (talk) 12:28, 23 September 2022 (UTC)
- Support Wikidata routinely indexes external identifiers. This is a small collection which makes it easier to approve. Also the subject matter is well aligned with Wikidata interests. Bluerasberry (talk) 13:17, 23 September 2022 (UTC)
- I generally agree, but many datasets will be available with various modifications.
- Eg
- CIFAR is also available as CIFAR_10_small, STL-10 ("CIFAR-10 dataset but with some modifications")
- The MNIST example is "mnist_784: The MNIST database of handwritten digits with 784 features". Is this the same as the orignal MNIST or different?
- I think WD has datasets in the sense of "Works"; In contrast, OpenML has concrete "Expressions"
- So do we point to all of them, or the "best" or "closest to original"? --Vladimir Alexiev (talk) 14:14, 23 September 2022 (UTC)
- @Vladimir Alexiev: Perhaps we start with "verified" / "original" datasets, as suggested by John Samuel below. In the long run, we need something like a "version" property for datasets — edition or translation of (P629) or software version identifier (P348) do not currently support that. --Daniel Mietchen (talk) 09:20, 26 September 2022 (UTC)
- Eg
- Support external identifier. IMO, Wikidata editors (considering the above comment) may point to verified (or original) datasets. John Samuel (talk) 16:16, 23 September 2022 (UTC)
- Support --Tinker Bell ★ ♥ 02:11, 24 September 2022 (UTC)
- @Daniel Mietchen, Dhx1, YULdigitalpreservation, Bluerasberry: @Vladimir Alexiev, Jsamwrites, Tinker Bell: Done ArthurPSmith (talk) 19:42, 30 November 2022 (UTC)