Last modified: 2014-11-16 17:55:56 UTC
from project chat: https://www.wikidata.org/wiki/Wikidata:Project_chat#JSON_dump_has_duplicates I've been working with the JSON dumps and notice that it has identical duplicate entries. For example, in the latest dump [3], line numbers 921522 and 16155575 are identical dumps of item Turi railway station (Q17100180). There are dozens of these duplicates. Should these be treated in a special way when processing the data dump? Jefft0 (talk) 01:17, 29 October 2014 (UTC) :It looks like another item page [4] redirects to Turi railway station (Q17100180). I don't think the redirect should be in the dump as a duplicate, so seems like a bug. But the redirect probably should be represented somewhere and in some form. Aude (talk) 07:18, 29 October 2014 (UTC)
I wonder whether we want to include information about redirects in there or simply leave redirects out? Just leaving them out will be easier to implement and not change the schema (thus not breaking b/c), so I'd suggest going down that road.
(In reply to Marius Hoch from comment #1) > I wonder whether we want to include information about redirects in there or > simply leave redirects out? Just leaving them out will be easier to > implement and not change the schema (thus not breaking b/c), so I'd suggest > going down that road. An item may have a property value which is a redirected item. Does the JSON dump "resolve" the item value and dump the item ID of the redirect target? If yes, then the JSON dump can leave out redirect information. But if the JSON dump does not resolve a redirect, then there must be a dump for the redirects (maybe in the JSON dump or another file).
See: http://tools.ietf.org/html/rfc6901
(In reply to Ori Livneh from comment #3) > See: http://tools.ietf.org/html/rfc6901 <hoo> thinking about it: That would horribly not work, although it might be semantically nice <hoo> our JSON has close to 40gb now (or so), thus people usually read it line by line and not as a one json Maybe we want a simple { id: "Q123", redirect: "Q1" } or something link this? People only care (if at all) for the id of the redirected entity and the target of the redirect.
Ok, we just talked about this and decided to simply leave redirects out of the dump for now. Having a separate json dump with redirect information might be doable though, if there is desire to have this (please open a new bug then).
Change 173664 had a related patch set uploaded by Hoo man: Don't include redirects in json dumps https://gerrit.wikimedia.org/r/173664