Last modified: 2014-09-04 07:32:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70793, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68793 - Wikidata JSON dump: better compression than gzip
Wikidata JSON dump: better compression than gzip
Status: NEW
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on: 54369
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-29 08:10 UTC by Nemo
Modified: 2014-09-04 07:32 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2014-07-29 08:10:26 UTC
I converted 20140721.json.gz to 20140721.json.xz and 20140721.json.bz2; gz is 2.9 GB, the other two were 2.0 GB. Saved space seems worth the effort.

For uncompression, which is what matters, xz uncompressed in 4 min vs. 2 min of gz. All the formats are supported natively by tar -af etc.; in recent versions, xz is parallel. I'm quoting from memory, because I killed the screen by mistake, but it seems LZMA/xz may be best choice.
Comment 1 Nemo 2014-07-29 11:57:42 UTC
(In reply to Nemo from comment #0)
> in recent
> versions, xz is parallel

Source: http://sourceforge.net/p/lzmautils/discussion/708858/thread/d37155d1/#d8af (currently Ubuntu has liblzma 5.1.0alpha, fedora 20 has 5.1.2alpha).
Comment 2 Markus Krötzsch 2014-09-04 07:32:56 UTC
This could also be implemented by offering several formats, as in the case of daily dumps. In this case, the URLs of the files should first be made more standard to help people find these files: Bug 70385 and Bug 68792.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links