Last modified: 2014-02-12 23:38:30 UTC
So, a bit of poking around looking at the database dumps and Special:Export for wikidatawiki For example, for Obama, we get something that starts like: <text xml:space="preserve" bytes="8538">{"label":{"en":"Barack Obama","fr":"Barack Obama","ar":"\u0628\u0627\u0631\u0627\u0643 \u0623\u0648\u0628\u0627\u0645\u0627","ru":"\u0411\u0430\u0440\u0430\u043a \u041e\u0431\u0430\u043c\u0430","nb":"Barack Obama","it":"Barack Obama","de":"Barack Obama","be-tarask":"\u0411\u0430\u0440\u0430\u043a \u0410\u0431\u0430\u043c\u0430","nan":"Barack Obama","ca":"Barack Obama"},"description":{"en":"President of the United States of America Full history for Q1-Q100 is currently 76.1MB. 7z turns that into 887KB I'm going to have a poke around at some other larger exports done via shell. I'm just wondering/thinking there might be a better way to represent this and override the backup handlers and produce a better backup format. Not high priority, but something to think about...
Well, JSON in XML will need to have quotes escaped... I can think of two ways to make this less painful: * use PHP serialization instead of JSON when generating XML. This only needs a small code change, since EntityHandler already supports PHP serialization. That sucks for portability, though - people want to process the dumps with Java, Python, etc. * use CDATA to wrap the JSON, instead of quoting. Nice and easy, the question is just: how does the exporter know when to do this? Or should we always use CDATA? But this may confuse tools that use regular expressions to process dumps, instead of properly parsing XML. But then, I guess such code is broken by design. So... other ideas?
What if we replace the " in the JSON with '?
SO, no, ' cannot be used instead of ". Stupid JSON spec. PHP serialization: Denny says no. CDATA is yucky too, but I am afraid this is the best way probably. :( Anyone other ideas? Otherwise, we could go for CDATA.
A side problem is that the unicode character don't need to be escaped per JSON spec so the bit could be rewrited in <text xml:space="preserve" bytes="8538">{"label":{"en":"Barack Obama","fr":"Barack Obama","ar":"باراك أوباما","ru":"Барак Обама","nb":"Barack Obama","it":"Barack Obama","de":"Barack Obama","be-tarask":"Барак Абама","nan":"Barack Obama","ca":"Barack Obama"},"description":{"en":"President of the United States of America which is 1.2 smaller in byte but the big win is ofcourse CDATA escaping <text xml:space="preserve" bytes="8538"><CDATA[[{"label":{"en":"Barack Obama","fr":"Barack Obama","ar":"باراك أوباما","ru":"Барак Обама","nb":"Barack Obama","it":"Barack Obama","de":"Barack Obama","be-tarask":"Барак Абама","nan":"Barack Obama","ca":"Barack Obama"},"description":{"en":"President of the United States of America which is twice smaller if CDATA way is chosen one should take care transforming ]]> sequences in ]]]]><CDATA[[> as explained here: http://stackoverflow.com/questions/223652/is-there-a-way-to-escape-a-cdata-end-token-in-xml
A possibility to embed JSON is YAML https://en.wikipedia.org/wiki/YAML but the tools for parsing it is less widespread