Last modified: 2013-06-04 11:30:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T46558, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 44558 - spurious username "0" in some dump revisions
spurious username "0" in some dump revisions
Status: RESOLVED WONTFIX
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low normal (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-31 18:02 UTC by Mark Nelson
Modified: 2013-06-04 11:30 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mark Nelson 2013-01-31 18:02:29 UTC
Occasionally a revision in the XML dump will have a username of "0", but with a valid id.

For example, here are two successive revisions of the en.wikipedia article "Anarchism", within about two hours of each other:

    <revision>
      <id>120319</id>
      <parentid>120190</parentid>
      <timestamp>2002-07-18T12:17:37Z</timestamp>
      <contributor>
        <username>DanKeshet</username>
        <id>170</id>
      </contributor>
      <comment>remove material sent to libertarian socialism</comment>
      <sha1>4j52impt0d600i5d25035c1iw1osp9q</sha1>
      <text id="120319" bytes="10324" />
    </revision>
    <revision>
      <id>59361</id>
      <parentid>120319</parentid>
      <timestamp>2002-07-18T14:33:46Z</timestamp>
      <contributor>
        <username>0</username>
        <id>170</id>
      </contributor>
      <comment>*</comment>
      <sha1>a0m28uiexf1udd7pibg63tyt0qhreqw</sha1>
      <text id="59361" bytes="8604" />
    </revision>

Obviously they're both by user "DanKeshet", whose user id is 170, but the second one has a username of "0" for some reason. They both display as being by "DanKeshet" if you look them up on the live wiki, though, I'm guessing because the live revision display is pulling by user_id? This always seems to happen in conjunction with a revision_id that's obviously out of sequence with the revisions before and after it, so I'm guessing some kind of artifact of db shuffling.

Preferred behavior, at least as a user of the data, would be for the dumps to attribute the revision to the same username that the live view at http://en.wikipedia.org/w/index.php?title=Anarchism&oldid=59361 does. More generally, I'm curious what's causing this anomaly, and if it indicates something else to watch out for, when cleaning dump data.
Comment 1 Ariel T. Glenn 2013-02-04 12:37:28 UTC
The reason that the user name is '0' in the second revision is that in the database the user_text field for that row in the revision table is indeed 0.

While a human looking at this could make the reasonable guess that the user who edited the second revision is the same as the one who edited the first one, it is possible for usernames to change, even during a given run.  Thus, during export no guesses are made about attribution.
Comment 2 Mark Nelson 2013-02-05 11:25:37 UTC
Fair enough. Any guess what led to these particular revisions having out-of-sequence revision IDs, and user_text of '0'? I know there is some weird stuff with old imported revisions, but the revisions showing this behavior seem to have be made around the same time as others which look 'normal'. I can work around this particular case, but curious if I'm missing something more general.
Comment 3 Ariel T. Glenn 2013-02-05 12:48:42 UTC
Gah, these are from 2002, who knows what the code was like then ("phase2").... we'll have to ask someone who's been around a lot longer for the answer to that.
Comment 4 Andre Klapper 2013-02-05 12:50:50 UTC
Might be as well RESOLVED WONTFIX (against investigating this problem) *if* this problem hasn't also happen in more recent dumps.
Comment 5 Ariel T. Glenn 2013-02-05 12:56:31 UTC
The question really is:

Are there revisions for which the user_text field for the revision is not zero but Export.php produces 0 for the username?  I have not heard of any yet, so I would close this until someone reports one.
Comment 6 Mark Nelson 2013-02-05 15:20:50 UTC
Closing is fine with me. I just scanned through the entire enwiki dump, and found no instances more recent than July 2002. So this looks like ancient database wonkiness rather than a bug in Export.php.

An intriguing regularity, for whatever it's worth: the revision just *after* the wonky revision is always one that was made in a narrow date range around July 20-24, 2002. So I would guess there was something weird going on that week with the database or code. That's enough evidence, anyway, to convince myself that it was a one-off thing I can work around by just using the user id for those revisions.
Comment 7 Ariel T. Glenn 2013-06-04 11:30:00 UTC
So why did I not ever wontfix this?  Doing so now...  At some pint we might want to try 'fixing' those bad entries but that is another bug entirely.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links