Last modified: 2014-07-29 08:10:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56369, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 54369 - Set up generation of JSON dumps for wikidata.org


Summary:	Set up generation of JSON dumps for wikidata.org

Status:	RESOLVED FIXED

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement with 1 vote (vote)
Target Milestone:	---
Assigned To:	Ariel T. Glenn

URL:
Whiteboard:
Keywords:

Depends on:	57214
Blocks:	68792 68793
	Show dependency tree / graph

Reported:	2013-09-20 09:50 UTC by Daniel Kinzler
Modified:	2014-07-29 08:10 UTC (History)
CC List:	7 users (show)

See Also:	57015
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Daniel Kinzler 2013-09-20 09:50:36 UTC

We would like to make the contents of wikidata.org available as a dump using our canonical JSON format. The maintenance script for doing this is 

  extensions/Wikibase/repo/maintenance/dumpJson.php

This will send a JSON serialization of all data entities to standard output, so I suppose that would best be piped through bz2.

This should work as-is, but there are several things that we should look out for or try out:

* I don't know how long it will take to make a complete dump. I expect that it'll be roughly the same as making an XML dump of the current revisions.
* I don't know how much RAM is required. Currently, all the IDs of the entities to output will be loaded into memory (by virtue of how the MySQL client library works) - that's a few dozen million rows. AS a guess, 1GB should be enough. 
* We may have to make the script more resilient to sporadic failures, especially since a failure would currently mean restarting the dump. 
* Perhaps sharding would be useful: the script supports --sharding-factor and --shard to control how m,any shards there should be, and which shard the script should process. Combining the output files is not as seamless as it could be, though (it involves chopping off lines at the beginning and the end of files).

Comment 1 Daniel Kinzler 2013-11-06 19:26:17 UTC

Addendum: use the --output command line option to specify an output file instead of using stdout. This enabled progress information and error reports to be written to stdout.

Comment 2 Aude 2013-11-12 11:04:17 UTC

steps are:

* try it manually (e.g. screen session) on terbium with test.wikidata
* try it manually with wikidatawiki
* figure out where the output will go (somewhere on dumps.wikimedia.org)
* setup cronjob to have it run periodically and automatically.

Comment 3 Ariel T. Glenn 2013-11-12 11:05:28 UTC

If someone could give me the command (with args) to run this on test.wikidata, I'll do that on terbium.

Comment 4 Aude 2013-11-12 11:15:53 UTC

/usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --output wikidata.json

optionally the sharding parameters can be used to allow the script to go faster:

e.g. 

--shard 2 --sharding-factor 3

Comment 5 Daniel Kinzler 2013-11-12 11:26:30 UTC

Note that the script doesn't fork itself for sharding. with --sharding-factor 3, you'll need 3 cron jobs (possibly on different boxes) with --shard 0, --shard 1, and --shard 2, respectively.

But for now, we should try without sharding; collecting the output for all shards into a single file would need some post-processing anyway.

Comment 6 Aude 2013-11-12 11:48:02 UTC

the script needs to be able to pipe to bzip while sending error elsewhere

Comment 7 Daniel Kinzler 2013-11-12 15:23:52 UTC

Three options bur compression & error reporting:

1) don't specify --output - then it'll write to stdout, and you can bzip it. Progress and error reporting is silenced, though.

2) use PHP's bzip2 stream wrapper: --output compress.bzip2://wikidata.json.bz2

3) make dumpJson.php always write errors to stderr; then it's no longer important whether you use --output or not.

Comment 8 Aude 2013-11-12 15:26:16 UTC

option #2 or #3 could work.  I would try #2.

silencing progress and error reporting is not an acceptable option, in my opinion

Comment 9 Daniel Kinzler 2013-11-13 16:15:00 UTC

While I think option #2 (using the stream wrapper for compression) would work, it gives no control over compression parameters.

I have filed bug 57015 for option #3; I think it would be nice to have that. But please go ahead and try and set up the dump script already, using the stream wrapper. There's no reason to wait for the logging options.

Comment 10 Ariel T. Glenn 2013-11-13 17:29:32 UTC

--output compress.bzip2://wikidata.json.bz2  is not doing it for me, I'm getting 

Warning: fopen(compress.bzip2://wikidata.json.bz2): failed to open stream: operation failed in /usr/local/apache/common-local/php-1.23wmf3/extensions/Wikibase/repo/maintenance/dumpJson.php on line 102
[fd9eae8e] [no req]   Exception from line 105 of /usr/local/apache/common-local/php-1.23wmf3/extensions/Wikibase/repo/maintenance/dumpJson.php: Failed to open compress.bzip2://wikidata.json.bz2!

Comment 11 Daniel Kinzler 2013-11-18 14:42:01 UTC

@ariel: maybe you just don't have write permission there? or php doesn't have bzip2 support build in? anyway...

I have made a patch introducing a --log option to control where log messages go, see I561a003.

I have also found and fixed a bug that caused invalid JSON in case an entity couldn't be loaded/dumped, see Ief7664d6.

I guess we have to wait for these to get deployed. Or just backport them, these patches are nicely local.

Comment 12 Ariel T. Glenn 2013-11-18 14:48:29 UTC

I was writing into my home directory on terbium, so surely I had permissions. Anyways, once the --log option is in, this is a moot point.

Comment 13 Ariel T. Glenn 2013-12-05 08:22:35 UTC

Just an update, after 22 hours of running against wikidata, we are at 202400 entities dumped. So sharding is going to be necessary; please start looking into what it would take to have one cron job for this and whatever prost-processing might be needed as well.  Alternatively, are there speedups possible in the script?

Comment 14 Ariel T. Glenn 2013-12-09 10:03:54 UTC

The dump concluded at Sat Dec  7 13:57:32 UTC 2013 with 221400 entities. File size (bz2 compressed): 98 mb.

Comment 15 Daniel Kinzler 2013-12-19 16:07:35 UTC

The dump *concluded* with 221400 entities dumped? That's... wrong. We have more than 10 million entities on wikidata.org.  Any idea how to best investigate this?

Also, what you report seems *extremely* slow. It took about 10 seconds for each entity (20k additional entities between the 5th and the 7th)? Wow...

There are no obvious points for speedup, but I could do some profiling. One thing that could be done of course is to not load the data from the database, but instead process an XML dump. Would that be preferable to you?

Comment 16 Rob Lanphier 2013-12-19 17:15:36 UTC

Hi Daniel, how quickly does it run in your tests?  Do you have the full dataset available via the XML dump?

Comment 17 Daniel Kinzler 2013-12-19 17:32:13 UTC

Creating a dump for the 349 items i have in the db on my laptop takes about 2 seconds. These are not very large, but then, most items on wikidata are not large either (while a few are very large).

 > time php repo/maintenance/dumpJson.php > /dev/null
  Processed 100 entities.
  Processed 200 entities.
  Processed 300 entities.
  Processed 349 entities.
 
  real	0m2.385s
  user	0m1.996s
  sys	0m0.088s

All data is available in the XML dumps, but we'd need two passes (for the first pass, a dump of the property namespace would be sufficient). I don't currently have a dump locally. 

The script would need quite a bit of refactoring to work based on XML dumps; I'd like to avoid that if we are not sure this is necessary / desirable. 

I don't think  we currently have a good way to test with a large data set ourselves at the moment. Importing XML dumps does not really work with wikibase (this is an annoying issue, but not easy to fix).

Comment 18 Daniel Kinzler 2013-12-19 17:34:30 UTC

addendum to my comment above: I suspect one large factor is loading the json from the external store. Is there a way to optimize that? We are only using the latest revision, so grouping revisions wouldn't help....

Still, going from 100+ items per second to 10 seconds per item is surprising, to say the least.

Comment 19 Ariel T. Glenn 2013-12-20 07:32:48 UTC

Processed 221200 entities.
Processed 221300 entities.
Processed 221400 entities.
Sat Dec  7 13:57:32 UTC 2013

that's the end of the output from the job (I still have the screen session on terbium).

I'm writing bz2 compressed output. You're writing to /dev/null.  That's going to be a big difference.

Here's the start of the last item in the dumped output:

{"id":"Q235312","type":"item","descriptions":...

Comment 20 Ariel T. Glenn 2013-12-20 09:02:03 UTC

221447 lines in the uncompressed file, total size of 1.1gb uncompressed.

Comment 21 Marius Hoch 2014-02-25 23:13:13 UTC

After I tested this myself on terbium, I found out that the PHP script is constantly leaking memory... I think this is because Wikibase is "smart" and statically caches all entities ever requested.

Another thing we noticed is that it's apparently not getting all entity ids from the query, it would probably be wise to batch the query getting the entity ids.

Comment 22 Nemo 2014-07-29 08:01:45 UTC

This is fixed, followups to be filed as
https://bugzilla.wikimedia.org/enter_bug.cgi?cc=wikidata-bugs%40lists.wikimedia.org&component=General%2FUnknown&product=Datasets

Announcement: http://lists.wikimedia.org/pipermail/wikidata-l/2014-July/004216.html

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links