Last modified: 2014-02-12 23:38:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T43790, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 41790 - Work on XML backup/export formats
Work on XML backup/export formats
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Wikidata bugs
: need-volunteer
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-05 18:07 UTC by Sam Reed (reedy)
Modified: 2014-02-12 23:38 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sam Reed (reedy) 2012-11-05 18:07:18 UTC
So, a bit of poking around looking at the database dumps and Special:Export for wikidatawiki

For example, for Obama, we get something that starts like:

<text xml:space="preserve" bytes="8538">{&quot;label&quot;:{&quot;en&quot;:&quot;Barack Obama&quot;,&quot;fr&quot;:&quot;Barack Obama&quot;,&quot;ar&quot;:&quot;\u0628\u0627\u0631\u0627\u0643 \u0623\u0648\u0628\u0627\u0645\u0627&quot;,&quot;ru&quot;:&quot;\u0411\u0430\u0440\u0430\u043a \u041e\u0431\u0430\u043c\u0430&quot;,&quot;nb&quot;:&quot;Barack Obama&quot;,&quot;it&quot;:&quot;Barack Obama&quot;,&quot;de&quot;:&quot;Barack Obama&quot;,&quot;be-tarask&quot;:&quot;\u0411\u0430\u0440\u0430\u043a \u0410\u0431\u0430\u043c\u0430&quot;,&quot;nan&quot;:&quot;Barack Obama&quot;,&quot;ca&quot;:&quot;Barack Obama&quot;},&quot;description&quot;:{&quot;en&quot;:&quot;President of the United States of America


Full history for Q1-Q100 is currently 76.1MB. 7z turns that into 887KB

I'm going to have a poke around at some other larger exports done via shell.



I'm just wondering/thinking there might be a better way to represent this and override the backup handlers and produce a better backup format. Not high priority, but something to think about...
Comment 1 Daniel Kinzler 2012-11-05 19:22:38 UTC
Well, JSON in XML will need to have quotes escaped... I can think of two ways to make this less painful:

* use PHP serialization instead of JSON when generating XML. This only needs a small code change, since EntityHandler already supports PHP serialization. That sucks for portability, though - people want to process the dumps with Java, Python, etc.

* use CDATA to wrap the JSON, instead of quoting. Nice and easy, the question is just: how does the exporter know when to do this? Or should we always use CDATA? But this may confuse tools that use regular expressions to process dumps, instead of properly parsing XML. But then, I guess such code is broken by design.

So... other ideas?
Comment 2 denny vrandecic 2012-11-07 18:15:43 UTC
What if we replace the " in the JSON with '?
Comment 3 denny vrandecic 2012-11-08 12:21:30 UTC
SO, no, ' cannot be used instead of ". Stupid JSON spec.

PHP serialization: Denny says no.

CDATA is yucky too, but I am afraid this is the best way probably. :(

Anyone other ideas? Otherwise, we could go for CDATA.
Comment 4 Xavier Combelle 2013-02-25 15:37:17 UTC
A side problem is that the unicode character don't need to be escaped per JSON spec so

the bit could be rewrited in

<text xml:space="preserve"
bytes="8538">{&quot;label&quot;:{&quot;en&quot;:&quot;Barack
Obama&quot;,&quot;fr&quot;:&quot;Barack
Obama&quot;,&quot;ar&quot;:&quot;باراك
أوباما&quot;,&quot;ru&quot;:&quot;Барак
Обама&quot;,&quot;nb&quot;:&quot;Barack
Obama&quot;,&quot;it&quot;:&quot;Barack Obama&quot;,&quot;de&quot;:&quot;Barack
Obama&quot;,&quot;be-tarask&quot;:&quot;Барак
Абама&quot;,&quot;nan&quot;:&quot;Barack
Obama&quot;,&quot;ca&quot;:&quot;Barack
Obama&quot;},&quot;description&quot;:{&quot;en&quot;:&quot;President of the
United States of America

which is 1.2 smaller in byte

but the big win is ofcourse CDATA escaping

<text xml:space="preserve"
bytes="8538"><CDATA[[{"label":{"en":"Barack
Obama","fr":"Barack
Obama","ar":"باراك
أوباما","ru":"Барак
Обама","nb":"Barack
Obama","it":"Barack Obama","de":"Barack
Obama","be-tarask":"Барак
Абама","nan":"Barack
Obama","ca":"Barack
Obama"},"description":{"en":"President of the
United States of America

which is twice smaller

if CDATA way is chosen one should take care transforming ]]> sequences in ]]]]><CDATA[[> as explained here:
http://stackoverflow.com/questions/223652/is-there-a-way-to-escape-a-cdata-end-token-in-xml
Comment 5 Xavier Combelle 2013-02-25 15:49:22 UTC
A possibility to embed JSON is YAML https://en.wikipedia.org/wiki/YAML but the tools for parsing it is less widespread

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links