Last modified: 2014-05-15 11:08:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64555, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 62555 - Add support for datasets


Summary:	Add support for datasets

Status:	RESOLVED WONTFIX

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	WikidataRepo (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest enhancement with 3 votes (vote)
Target Milestone:	---
Assigned To:	Wikidata bugs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-03-12 00:13 UTC by dacuetu
Modified:	2014-05-15 11:08 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description dacuetu 2014-03-12 00:13:45 UTC

"A dataset corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question."

For inspiration see:
http://json-stat.org/schema/
http://dataprotocols.org/json-table-schema/
http://www.w3.org/wiki/WebSchemas/Datasets

Related discussions:
https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Archive/2014/03#What.27s_the_plan_for_heavy_data.3F
https://meta.wikimedia.org/wiki/Talk:DataNamespace

Comment 1 filceolaire 2014-03-14 22:38:13 UTC

So in practice we use this where a property has hundreds of different values distinguished by qualifiers

e.g a table for population property values for a particular administrative region with columns for date; sex; age from; age to; race; religion; Source; basis; preferred/deprecated etc?

Is this just a presentation thing or a change to the data format?

Comment 2 dacuetu 2014-03-15 16:57:36 UTC

There are important differences:
1.- Fixed structure: theoretically you could convert any dataset into a classical statement-qualifier structure, but not always the other way round. And there is no need to.
2.- No need to source: a dataset is part of an statement so the source would go on the statement where it is linked from, not on the page where the information is stored (it becomes much easier to manage)
3.- Non-editable content: like with media files, datasets are not expected to change. You could clear them and repopulate them again, though (like uploading a new version of a file on Commons). Defining which property corresponds to each row, which Q-items are used in the dataset, and checking datatype constraints, that should be done only on dataset creation/upload.

In any case it provides advantadges over just using tabular data, like re-using the well defined existing wikidata datatypes and the next to come, re-using visualization options that might be developed for the queries, etc.

Comment 3 John F. Lewis 2014-04-08 20:56:07 UTC

Set to Lowest for now.

How would this be used practically within Wikidata? Such as from my understanding, this will contain different types of data typed data as well as some other different types of information.

In basic, how would all the features of Wikidata (source, qualifiers, ranks) and functions such as searching, sorting, querying, lua and parsing be integrated with this?

Comment 4 Lydia Pintscher 2014-05-05 13:38:39 UTC

Wontfixing this as per lack of answers to John's fundamental questions. I'm sorry but we really shouldn't do this if we don't want to break some basics of Wikidata.

Comment 5 Yair Rand 2014-05-05 13:47:07 UTC

(In reply to John F. Lewis from comment #3)
> How would this be used practically within Wikidata? Such as from my
> understanding, this will contain different types of data typed data as well
> as some other different types of information.
> 
> In basic, how would all the features of Wikidata (source, qualifiers, ranks)
> and functions such as searching, sorting, querying, lua and parsing be
> integrated with this?
If dataset/chart/table was considered its own datatype, then sources, qualifiers, and ranks could work as with other statements. An item could contain a statement, and the value of the statement would be a chart, and sources, qualifiers, and ranks would be set through the normal interface (the "add source" button and so on).

Comment 6 dacuetu 2014-05-05 14:52:16 UTC

As Yair said the data would be considered a (json?) blob and sourced/ranked/qualified as a whole. The blob could reside in its own namespace page (Dataset:X), which won't be editable as other items. Instead, new revisions could be uploaded and the data typing would be tested on upload. 

The upload test would consist on:
- check that the data conforms to the WD data representations
- check that there is always the same claim-qualifier structure as given by the first claim

Lua modules could access it as other wikidata items (since the datatypes, structure, etc is the same as any item) with the difference that the data is considered static and its structure always the same. 

For instance this table [1] would be called "Dataset:Populations with multiracial identifiers in CA 2010". It would be used as a value of a claim as:
"Demographics of California (Q3044234)" <census data> "Dataset:Populations with multiracial identifiers in CA 2010"
Sourced, and qualified as usual ("year of creation:2010").

The dataset itself would be represented as claims (*) with qualifiers (--):
*Group: White
--population:15,763,625
--percentage:42.3%
*Group: Hispanic
--population:14,013,719
--percentage:37.6%
etc.

This would translate visually into a non-editable spreadsheet:
{| 
|-
! Group !! population!! percentage
|-
| White|| 15,763,625|| 42.3%
|-
| Hispanic|| 14,013,719|| 37.6%
|}

I hope this answers John's fundamental questions. Lydia, do you still think that it would break some Wikidata basics? If so, could it be considered for Wikibase-Commons?

[1] https://en.wikipedia.org/wiki/Demographics_of_California#2010_Populations_with_multiracial_identifiers

Comment 7 Lydia Pintscher 2014-05-05 15:05:45 UTC

Yes I do think so. Because none of this addresses how it'd be treated in queries for example. If you have a spreadsheet of the population for example how is this going to show up in the searches for "population > 5 million" and so on? And where do you draw the line between having such data in a "tabular datatype" that consists of other datatypes and having them as statements on their own. I'm sorry folks but there are so many conceptual issues with this...

Comment 8 dacuetu 2014-05-05 15:23:18 UTC

The thing with datasets is that it is data that it is not usually included into Wikidata proper, because the effort required to enter it and maintain it as regular data would be too big. By offering a simplified alternative at least the data could be shared without copying and pasting. It also avoids creating wikidata pages with so many statements that cannot be loaded. And perhaps can be used to generate visual representations.

Besides, it is not necessary that it shows up in searches other than in the same way that Commons files show up in searches.

Comment 9 dacuetu 2014-05-06 09:14:17 UTC

As an implementation example, see:
http://datahub.io/

Example 1: "2011 Annual Report for the Vancouver Landfill"
http://datahub.io/dataset/2011-city-of-vancouver-landfill-quantities-of-nuisance-waste-and-recyclable-materials/resource/6fdf7864-415b-4a11-88f6-351645cf802f

Example 2: "Spanish Premier Football league 2013/2014"
http://datahub.io/dataset/spain-football-match-data-la-liga-primera-segunda/resource/d2a579f9-d3aa-49e8-8bc5-e63db55106d1

Example 3: http://datahub.io/dataset/municipal-organics-diversion-carbon-credits-for-carbon-neutral-reporting-2012-reporting-year/resource/b4ee5c88-08a5-4f2c-a5ab-f301a3a6d956

All that is data that probably won't make it into Wikidata but it might be useful for Wikimedia projects if it lives in a central, structured repository.

Perhaps Wikidata wouldn't be the right place to store it, but Wikibase could provide the technology to another site.

Comment 10 John Mark Vandenberg 2014-05-06 09:55:45 UTC

I agree this is needed (somewhere).  If there are no native data islands, they will appear as data wrspped in code, instead of code having nice libraries (above mw.loadData) to access structured datasets.

https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.loadData

Comment 11 dacuetu 2014-05-15 11:08:04 UTC

RFC on datasets:
https://meta.wikimedia.org/wiki/Requests_for_comment/How_to_deal_with_open_datasets

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links