Last modified: 2013-06-04 11:30:00 UTC
Occasionally a revision in the XML dump will have a username of "0", but with a valid id. For example, here are two successive revisions of the en.wikipedia article "Anarchism", within about two hours of each other: <revision> <id>120319</id> <parentid>120190</parentid> <timestamp>2002-07-18T12:17:37Z</timestamp> <contributor> <username>DanKeshet</username> <id>170</id> </contributor> <comment>remove material sent to libertarian socialism</comment> <sha1>4j52impt0d600i5d25035c1iw1osp9q</sha1> <text id="120319" bytes="10324" /> </revision> <revision> <id>59361</id> <parentid>120319</parentid> <timestamp>2002-07-18T14:33:46Z</timestamp> <contributor> <username>0</username> <id>170</id> </contributor> <comment>*</comment> <sha1>a0m28uiexf1udd7pibg63tyt0qhreqw</sha1> <text id="59361" bytes="8604" /> </revision> Obviously they're both by user "DanKeshet", whose user id is 170, but the second one has a username of "0" for some reason. They both display as being by "DanKeshet" if you look them up on the live wiki, though, I'm guessing because the live revision display is pulling by user_id? This always seems to happen in conjunction with a revision_id that's obviously out of sequence with the revisions before and after it, so I'm guessing some kind of artifact of db shuffling. Preferred behavior, at least as a user of the data, would be for the dumps to attribute the revision to the same username that the live view at http://en.wikipedia.org/w/index.php?title=Anarchism&oldid=59361 does. More generally, I'm curious what's causing this anomaly, and if it indicates something else to watch out for, when cleaning dump data.
The reason that the user name is '0' in the second revision is that in the database the user_text field for that row in the revision table is indeed 0. While a human looking at this could make the reasonable guess that the user who edited the second revision is the same as the one who edited the first one, it is possible for usernames to change, even during a given run. Thus, during export no guesses are made about attribution.
Fair enough. Any guess what led to these particular revisions having out-of-sequence revision IDs, and user_text of '0'? I know there is some weird stuff with old imported revisions, but the revisions showing this behavior seem to have be made around the same time as others which look 'normal'. I can work around this particular case, but curious if I'm missing something more general.
Gah, these are from 2002, who knows what the code was like then ("phase2").... we'll have to ask someone who's been around a lot longer for the answer to that.
Might be as well RESOLVED WONTFIX (against investigating this problem) *if* this problem hasn't also happen in more recent dumps.
The question really is: Are there revisions for which the user_text field for the revision is not zero but Export.php produces 0 for the username? I have not heard of any yet, so I would close this until someone reports one.
Closing is fine with me. I just scanned through the entire enwiki dump, and found no instances more recent than July 2002. So this looks like ancient database wonkiness rather than a bug in Export.php. An intriguing regularity, for whatever it's worth: the revision just *after* the wonky revision is always one that was made in a narrow date range around July 20-24, 2002. So I would guess there was something weird going on that week with the database or code. That's enough evidence, anyway, to convince myself that it was a one-off thing I can work around by just using the user id for those revisions.
So why did I not ever wontfix this? Doing so now... At some pint we might want to try 'fixing' those bad entries but that is another bug entirely.