Last modified: 2014-04-01 19:28:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48014, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46014 - inconsistent revision id and html content
inconsistent revision id and html content
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: High normal with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-12 05:48 UTC by anthonyzhang
Modified: 2014-04-01 19:28 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
the html crawled from wikipedia page http://en.wikipedia.org/wiki/Netherlands (571.25 KB, text/html)
2013-03-12 05:48 UTC, anthonyzhang
Details
stale cache of Abdullah of Saudi Arabia on Aug. 21 (336.11 KB, text/plain)
2013-08-23 02:04 UTC, bianjiang
Details
title=St._Louis_Cardinals, oldid=579097247 (532.89 KB, text/html)
2013-11-19 06:10 UTC, bianjiang
Details
stale cache of title=St._Louis_Cardinals, oldid=579097280 (540.71 KB, text/html)
2013-11-19 06:14 UTC, bianjiang
Details
de.wikipedia.org/w/index.php?title=Albanien&oldid=125280145 (406.50 KB, application/octet-stream)
2013-12-11 04:01 UTC, bianjiang
Details
de.wikipedia.org/w/index.php?title=Albanien&oldid=125280167 (420.93 KB, application/octet-stream)
2013-12-11 04:02 UTC, bianjiang
Details
stale cache of title= Roslindale, oldid=583299083 (24.69 KB, text/html)
2014-02-12 14:00 UTC, bianjiang
Details
stale cache of title=Memorial_to_the_Murdered_Jews_of_Europe, oldid=599805302 (50.20 KB, text/html)
2014-03-27 17:53 UTC, anthonyzhang
Details

Description anthonyzhang 2013-03-12 05:48:18 UTC
Created attachment 11914 [details]
the html crawled from wikipedia page http://en.wikipedia.org/wiki/Netherlands

When we crawled the wikipedia page http://en.wikipedia.org/wiki/Netherlands the responsed HTML has the following content:

<div class="printfooter"> Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Netherlands&amp;oldid=543458973">

So it should be revision 543458973's html content. But it also has this content: "Netherland people are also homosexual." which is the previous revision 543458897's content. It is a terrible inconsistency.

The current HTML is fixed, I put the snapshot at the attached file. Please take a look.

Thanks!
Comment 1 Andre Klapper 2013-03-12 10:19:22 UTC
How did you crawl that page? Please provide steps to reproduce.

[Not a bug report about Bugzilla itself, hence moving]
Comment 2 anthonyzhang 2013-03-12 10:32:37 UTC
We sent standard HTTP request with URL "http://en.wikipedia.org/wiki/Netherlands" to Wikipedia's server and got the returned HTTP response, then stored it to the attached file.

I can't reproduce the same HTTP response now, the Netherlands page is completely updated now. Is it possible for this kind of inconsistency? Is the HTTP response I got is unexpected?
Comment 3 Andre Klapper 2013-03-12 10:48:23 UTC
(In reply to comment #2)
> We sent standard HTTP request with URL
> "http://en.wikipedia.org/wiki/Netherlands" to Wikipedia's server and got the
> returned HTTP response, then stored it to the attached file.

With which tool? With which command? This is a bit vague so far.
Comment 4 anthonyzhang 2013-03-13 04:49:08 UTC
I used a Google internal library written in C++. It generated HTTP request in HTTP protocol 1.1. It is similar with command "wget http://en.wikipedia.org/wiki/Netherlands"

What's the difference between different tools? Why does it matter?

When MediaWiki generates the HTML content of one page, it should use the matched revision id and wikitext, right?
Comment 5 anthonyzhang 2013-03-20 06:56:18 UTC
gentle ping.
Comment 6 Andre Klapper 2013-03-20 12:55:55 UTC
(In reply to comment #2)
> I can't reproduce the same HTTP response now, the Netherlands page is
> completely updated now.

It would be great to know at which exact time you ran your query to compare it with the timestamps of the changes on the wikipage. It might either have been "bad timing" or that some caching servers were not updated yet, in this case mw1039.

(In reply to comment #4)
> What's the difference between different tools? Why does it matter?

It's helpful to know any parameters when somebody tries to reproduce your problem.
Comment 7 Zark Khullah 2013-08-21 05:11:35 UTC
I have seen the same inconsistency with the article http://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence


Reading the article, it shows this text:

"In programming language theory and proof theory, the Curry–Howard correspondence (also known as the Curry–Howard isomorphism or equivalence, or the proofs-as-programs and propositions- or formulae-as-types interpretation) is the direct relationship between computer programs and mathematical proofs. It is a generalization of a syntactic analogy between systems of formal logic and computational calculi.

It is the link between Logic and Computation that is usually attributed to H. B. Curry and W. A. Howard, although the idea is related to theoperational interpretation of intuitionistic logic given in variousformulations by Brouwer, Heyting and Kolmogorov.

Origin, scope, and consequences (...)"


When I try to edit, the code shows (in visual edit OR source edit):

"In programming language theory and proof theory, the Curry–Howard correspondence (also known as the Curry–Howard isomorphism or equivalence, or the proofs-as-programs and propositions- or formulae-as-types interpretation) is the direct relationship between computer programs and mathematical proofs. It is a generalization of a syntactic analogy between systems of formal logic and computational calculi that was first discovered by the American mathematician Haskell Curry and logician William Alvin Howard.[citation needed]

Origin, scope, and consequences (...)"


Comparing the two last edits, it can be seen that the text shown is not from the latest edit, but the previous one.

Revision as of 08:01, 18 August 2013 (oldid=569063806)
Latest revision as of 08:01, 18 August 2013 (oldid=569063819)


Hope this helps confirm the reported bug.
Comment 8 Chris Steipp 2013-08-21 17:17:04 UTC
That is interesting. Adding some other people who might know why that is happening.
Comment 9 bianjiang 2013-08-23 01:53:59 UTC
We keep seeing such issue, one more on Aug. 21:

We were fetching revid=569587966, which is a fix after a vandalism :
http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&oldid=569587966
But the content still contains bad editing from revid=569587944
http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&oldid=569587944

The HTML (rev=569587966) we got looks like: (details in attachment)
   1 HTTP/1.0 200 OK^M
   2 X-Content-Type-Options: nosniff^M
   3 Content-Language: en^M
   4 Last-Modified: Wed, 21 Aug 2013 15:53:30 GMT^M
   5 Content-Encoding: gzip^M
   6 Content-Length: 69500^M
   7 Content-Type: text/html; charset=UTF-8^M
   8 Date: Wed, 21 Aug 2013 15:54:05 GMT^M
   9 Server: Apache^M
  10 Cache-Control: private, s-maxage=0, max-age=0, must-revalidate^M
  11 Vary: Accept-Encoding,Cookie^M
  12 Age: 124^M
  13 X-Cache: HIT from cp1017.eqiad.wmnet^M
  14 X-Cache-Lookup: HIT from cp1017.eqiad.wmnet:3128^M
  15 X-Cache: MISS from cp1010.eqiad.wmnet^M
  16 X-Cache-Lookup: MISS from cp1010.eqiad.wmnet:80^M
  17 Connection: keep-alive^M
  18 ^M
  19 <!DOCTYPE html>
  20 <html lang="en" dir="ltr" class="client-nojs">

......

  40 mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Abdullah_of_Saudi_Arabia","wgTitle":"Abdullah of Saudi Arabia     ","wgCurRevisionId":569587966,"wgArticleId":19186951,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages      using citations with accessdate and no URL","All articles with dead external links","Articles with dead external links from August 2013","Wikipedia indefinitely move-protect     ed pages","Use dmy dates from August 2013",

......

  66                                                                 <div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><p><b>Hey King OF ISLAMIC World, Hey Kin     g OF SAUDI ARABIA. Please Dont Support to Terrorist Militaray of Egypt. They are killing Muslims out of law and out of Democracy. I know you know everything But You are not      Muslim.Thats why You are support to kill Muslims. I humbly request to people of saudi arabia please show the straight way to the YOUR'S king.</b></p>
  67 <table class="infobox vcard" style="font-size: 88%; text-align: left; width: 22em">

......

1787 <noscript><img src="//en.wikipedia.org/w/index.php?title=Special:CentralAutoLogin/start&amp;type=1x1&amp;from=enwiki" alt="" title="" width="1" height="1" style="border: non     e; position: absolute;" /></noscript></div>                                                                <div class="printfooter">
1788                                 Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&amp;oldid=569587966">http://en.wikipedia.org/w/in     dex.php?title=Abdullah_of_Saudi_Arabia&amp;oldid=569587966</a>"                            </div>
1789                                                                                                 <div id='catlinks' class='catlinks'><div id="mw-normal-catlinks" class="mw-no

......

1890                         <li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&amp;bookcmd=book_creator&amp;referer=Abdullah+of+Saudi+Arabia">Create a book</a>     </li>
1891                         <li id="coll-download-as-rl"><a href="/w/index.php?title=Special:Book&amp;bookcmd=render_article&amp;arttitle=Abdullah+of+Saudi+Arabia&amp;oldid=5695     87966&amp;writer=rl">Download as PDF</a></li>
1892                         <li id="t-print"><a href="/w/index.php?title=Abdullah_of_Saudi_Arabia&amp;printable=yes" title="Printable version of this page [p]" accesskey="p">Pri     ntable version</a></li>
1893                 </ul>

......
Comment 10 bianjiang 2013-08-23 02:04:01 UTC
Created attachment 13154 [details]
stale cache of Abdullah of Saudi Arabia on Aug. 21

The header part of this attachment is the HTTP response headers. The following is the HTML we get.
Comment 11 bianjiang 2013-08-29 13:41:13 UTC
The revision is generated at 1377593916 (Tue Aug 27 08:58:36 UTC 2013), and then we fetch it is at 1377596440 (Tue Aug 27 09:40:40 UTC 2013) - ~40 minutes later.

But we still get a stale cache:

The HTML contains both content from the claimed revision "570370203"

"""
<div class="printfooter">
                                Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Russell_Crowe&amp;oldid=570370203">http://en.wikipedia.org/w/index.php?title=Russell_Crowe&amp;oldid=570370203</a>"                          </div>
"""


and stale content from the previous revision "570370195"

"""
is a ball sack loving faggot who I actually enjoyed in
"""
Comment 12 Chris Steipp 2013-08-29 16:01:02 UTC
Bryan/Brian, since you guys have been working on this for images, do you have thoughts about how the cache is purged when revisions are rolled back or deleted? Google is finding that the removed html is often still returned. Any ideas would help.
Comment 13 Bawolff (Brian Wolff) 2013-08-29 16:56:09 UTC
Re    Zark Khullah in comment 7 - Are you logged in or logged out when this happens.


To clarify, is this only google getting the old versions, or is it also logged out users that have cleared all their cookies? Or do logged in users sometimes get the old version.

Additionally, is this only for things that were reverted within roughly a minute of the edit being made (aka ClueBot_NG reverts). There may be some sort of race condition with the reverts so close to the original edit.

[I checked for the cache purging in general, and it seems to be working on the pages i tested, so its not a site wide outage of cache purging]
Comment 14 Chris Steipp 2013-08-29 17:37:03 UTC
Bawolff-- it's logged out users (ops asked them to crawl anonymously, so they hit the cache). I hadn't looked into how the reverts were made, if they we were cluebot vs. manual, but that could be an issue. All the examples have been very blatant stuff that Cluebot could have picked up.
Comment 15 Gabriel Wicke 2013-08-30 03:04:46 UTC
We discussed this in the office a few days ago. The updated revision ID in the HTML means that the front-end cache was properly purged. 

Some possible issues to check:

* Bad parser cache validation.

* PHP reading some information from MySQL master (the revision id used for the footer) while using lagged slaves for other info (the revision id used for parser cache validation / parsing).

Since this is relatively rare it could be a race condition, where an anon request happens shortly after an update. It might help to correlate edit timestamps with render timestamps in the bad HTML and slave lag at the time.
Comment 16 bianjiang 2013-09-02 14:00:26 UTC
Hi,
Any update on this?
It looks like the bug always appears with ClueBot reverting a vandalism (not every reverting trigger it though). We just noticed one more on  http://en.wikipedia.org/w/index.php?title=Barbra_Streisand&oldid=571125240

the phenomenon is similar to the one I reported before.
Comment 17 Greg Grossmeier 2013-09-03 21:45:52 UTC
Assigning to Aaron for now to do some deeper digging here.
Comment 18 Derk-Jan Hartman 2013-09-06 09:37:25 UTC
We had a few similar reports en.wp's VP/T the other day after the https deploy.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=571756338#Excessive_caching.3F
Comment 19 Derk-Jan Hartman 2013-09-06 09:39:10 UTC
Again, cluebot reverts, coincidence ? seems like we might be missing some cache clear signals from the api entry points ?
Comment 20 bianjiang 2013-09-13 01:41:18 UTC
Hi wiki-dev,

is there any update on this issue?

Thanks
Comment 21 Gerrit Notification Bot 2013-09-24 20:57:03 UTC
Change 85917 had a related patch set uploaded by Aaron Schulz:
Reduce chance for parser cache race conditions

https://gerrit.wikimedia.org/r/85917
Comment 22 Gerrit Notification Bot 2013-09-30 17:01:16 UTC
Change 85917 merged by jenkins-bot:
Reduce chance for parser cache race conditions

https://gerrit.wikimedia.org/r/85917
Comment 23 Andre Klapper 2013-10-18 18:37:34 UTC
(In reply to comment #22)
> Change 85917 merged by jenkins-bot:
> Reduce chance for parser cache race conditions

Aaron: Is more work needed, or can this be closed as RESOLVED FIXED?
Comment 24 Aaron Schulz 2013-10-18 21:17:29 UTC
Assuming this is fixed now after that change (aside from maybe a few older cached entries, which should all be gone by a month).
Comment 25 bianjiang 2013-11-19 06:10:26 UTC
Created attachment 13829 [details]
title=St._Louis_Cardinals, oldid=579097247

old version of St. Louis Cardinals, right before we hit stale cache:
http://en.wikipedia.org/w/index.php?title=St._Louis_Cardinals&amp;oldid=579097247
Comment 26 bianjiang 2013-11-19 06:14:14 UTC
Created attachment 13830 [details]
stale cache of title=St._Louis_Cardinals, oldid=579097280

The new (i.e. staled) revision of the article that we get when crawling it:
http://en.wikipedia.org/w/index.php?title=St._Louis_Cardinals&amp;oldid=579097280
Comment 27 bianjiang 2013-11-19 06:18:56 UTC
(I don't know how to reopen this bug)

We actually met with stale cache again on Oct. 28. I've attached the staled version (rev=579097280).

Again, it seems 2 revisions (579097247, 579097280) happen in very short time window, and the HTML of the latter (579097280) was rendered using old wikitext.

The strange thing is that part of the new HTML is changed:
The old one looks like:
.. ”wgRevisionId”: 579097247..
<p> 
…a gay butt sex team based… 
</p>
…
<h2>Further reading</h2>
...

The new one (staled one) looks like:
.. ”wgRevisionId”: 579097280..
<p> 
…a gay butt sex team based… 
</p>
…
<h2>Further reading</h2><span class=”mw-editsection”>...</span>
...
Comment 28 Andre Klapper 2013-11-19 16:04:05 UTC
I am reopening this based on last comments.

Aaron: Could you take a look at this again?
Any specific / more information that could be gathered / provided?
Comment 29 Aaron Schulz 2013-11-19 17:28:07 UTC
What instances were encountered after 11/18/13? I said there would be stale items for a month after I closed this (the maximum cache time for our proxies).
Comment 30 bianjiang 2013-11-19 19:11:25 UTC
The one we encountered (i've upload 2 attachments) happened on 12:41, 28 October 2013.
Comment 31 Andre Klapper 2013-11-20 09:04:55 UTC
Aaron: Sorry, didn't read comment 24 closely.

Closing this ticket again as there is no indication that there is still a problem. Please inform us if this problem still happens after November 18, 2013.
Comment 32 bianjiang 2013-12-11 04:01:18 UTC
Created attachment 14054 [details]
de.wikipedia.org/w/index.php?title=Albanien&oldid=125280145

a vandalized revision of Albanien on dewiki, at 2013-12-09 17:07:12
Comment 33 bianjiang 2013-12-11 04:02:58 UTC
Created attachment 14055 [details]
de.wikipedia.org/w/index.php?title=Albanien&oldid=125280167

a fix for "Albanien" on dewiki, at 2013-12-09 17:07:58
Comment 34 bianjiang 2013-12-11 04:05:20 UTC
@Andre,

we met with another stale case on Dec. 9. I've uploaded 2 snapshot of the stale. As you can see, the vandalized content "ist ein dummes Land mit Zigeunern" appear in both revision.

Could you reopen the bug and assign properly?

Thanks
Comment 35 Bawolff (Brian Wolff) 2013-12-11 05:33:05 UTC
(In reply to comment #32)
> Created attachment 14054 [details]
> de.wikipedia.org/w/index.php?title=Albanien&oldid=125280145
> 
> a vandalized revision of Albanien on dewiki, at 2013-12-09 17:07:12

I believe the original bug report was for having incorrect revision when viewing the current version, not when viewing oldids. Are you reporting that an old version showed the incorrect revision?
Comment 36 bianjiang 2013-12-11 06:31:28 UTC
I used "oldid" in attachment name and descriptions just to make sure people can easily check those history revisions.

The staled content happened when we fetch rev:125280167, and we simply crawl it via "http://de.wikipedia.org/wiki/Albanien". No "oldid" is used.

So it's still same bug I think.
Comment 37 bianjiang 2014-02-12 14:00:26 UTC
Created attachment 14572 [details]
stale cache of title= Roslindale, oldid=583299083

This html was fetched 100 minutes after the creation of the revision via
http://en.wikipedia.org/Roslindale 
(i.e. no oldid is used)

As you can see, the bad content ("COW ...") in previous revision remains there.

So the content remains stale for 100 minutes.
Comment 38 anthonyzhang 2014-03-27 17:53:19 UTC
Created attachment 14944 [details]
stale cache of title=Memorial_to_the_Murdered_Jews_of_Europe, oldid=599805302

The source code of this HTML has revision id 599805302, but it has the HTML text "Holyhoax" from the previous revision (id is 599805297). We crawled this HTML about 2 hours after revision 599805302 was made.
Comment 39 Gerrit Notification Bot 2014-04-01 16:19:27 UTC
Change 122847 had a related patch set uploaded by Anomie:
Include parsed revision ID in parser cache

https://gerrit.wikimedia.org/r/122847
Comment 40 Gerrit Notification Bot 2014-04-01 16:49:59 UTC
Change 122847 merged by jenkins-bot:
Include parsed revision ID in parser cache

https://gerrit.wikimedia.org/r/122847
Comment 41 Brad Jorsch 2014-04-01 19:28:55 UTC
Please reopen if this bug occurs again after 1.23wmf21 is deployed, see https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap for the schedule (short version: after April 10).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links