Last modified: 2014-11-16 17:55:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74678, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72678 - json dumps have duplicate items (one for the redirect, one for the target)
json dumps have duplicate items (one for the redirect, one for the target)
Status: PATCH_TO_REVIEW
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Wikidata bugs
u=dev c=backend p=0
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-29 15:32 UTC by Aude
Modified: 2014-11-16 17:55 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aude 2014-10-29 15:32:37 UTC
from project chat:

https://www.wikidata.org/wiki/Wikidata:Project_chat#JSON_dump_has_duplicates

I've been working with the JSON dumps and notice that it has identical duplicate entries. For example, in the latest dump [3], line numbers 921522 and 16155575 are identical dumps of item Turi railway station (Q17100180). There are dozens of these duplicates. Should these be treated in a special way when processing the data dump? Jefft0 (talk) 01:17, 29 October 2014 (UTC)

:It looks like another item page [4] redirects to Turi railway station (Q17100180). I don't think the redirect should be in the dump as a duplicate, so seems like a bug. But the redirect probably should be represented somewhere and in some form. Aude (talk) 07:18, 29 October 2014 (UTC)
Comment 1 Marius Hoch 2014-11-01 01:24:17 UTC
I wonder whether we want to include information about redirects in there or simply leave redirects out? Just leaving them out will be easier to implement and not change the schema (thus not breaking b/c), so I'd suggest going down that road.
Comment 2 Jeff Thompson 2014-11-03 16:01:53 UTC
(In reply to Marius Hoch from comment #1)
> I wonder whether we want to include information about redirects in there or
> simply leave redirects out? Just leaving them out will be easier to
> implement and not change the schema (thus not breaking b/c), so I'd suggest
> going down that road.

An item may have a property value which is a redirected item.  Does the JSON dump "resolve" the item value and dump the item ID of the redirect target?  If yes, then the JSON dump can leave out redirect information.  But if the JSON dump does not resolve a redirect, then there must be a dump for the redirects (maybe in the JSON dump or another file).
Comment 3 Ori Livneh 2014-11-07 19:02:31 UTC
See: http://tools.ietf.org/html/rfc6901
Comment 4 Marius Hoch 2014-11-07 19:09:48 UTC
(In reply to Ori Livneh from comment #3)
> See: http://tools.ietf.org/html/rfc6901

<hoo> thinking about it: That would horribly not work, although it might be semantically nice
<hoo> our JSON has close to 40gb now (or so), thus people usually read it line by line and not as a one json

Maybe we want a simple { id: "Q123", redirect: "Q1" } or something link this? People only care (if at all) for the id of the redirected entity and the target of the redirect.
Comment 5 Marius Hoch 2014-11-15 16:50:59 UTC
Ok, we just talked about this and decided to simply leave redirects out of the dump for now.

Having a separate json dump with redirect information might be doable though, if there is desire to have this (please open a new bug then).
Comment 6 Gerrit Notification Bot 2014-11-16 17:55:53 UTC
Change 173664 had a related patch set uploaded by Hoo man:
Don't include redirects in json dumps

https://gerrit.wikimedia.org/r/173664

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links