Last modified: 2013-11-04 01:07:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56647, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54647 - Wikis show more and more pages in an old cached version instead of the current one
Wikis show more and more pages in an old cached version instead of the curren...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: High major (vote)
: ---
Assigned To: Bryan Davis
:
Depends on: 55632
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-26 16:00 UTC by Umherirrender
Modified: 2013-11-04 01:07 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Traffic record of an example of the outdated version issue (86.98 KB, application/octet-stream)
2013-09-27 19:07 UTC, Volker Alexander
Details

Description Umherirrender 2013-09-26 16:00:34 UTC
There are more and more user on dewiki, which reports that the will see old version when visiting a page or the history of the page

Some time ago there was a big problem with purge requested not processed by the cache server. Is this problem back? I did not found the bug numbers at the moment.
Or is there another problem?

* https://de.wikipedia.org/wiki/Wikipedia:Technik/Werkstatt#Anzeige.2FVersionsgeschichte_f.C3.BCr_unangemeldete_Nutzer_hinkt_hinterher
* https://de.wikipedia.org/wiki/Wikipedia:Technik/Werkstatt#FlatOut_2
* https://de.wikipedia.org/wiki/Wikipedia:Fzw#Lemma:_Die_letzten_.C3.84nderungen_werden_nur_sichtbar.2C_wenn_ich_als_User_eingeloggt_bin.3F.21
* https://de.wikipedia.org/wiki/Wikipedia:Fzw#Spinnt_die_Software_oder_drehen_die_Hamster_durch.3F
* https://de.wikipedia.org/wiki/Wikipedia:Fzw#Spinnt_die_Software_oder_drehen_die_Hamster_durch.3F_Die_zweite
Comment 1 Nemo 2013-09-27 15:09:28 UTC
Just to make sure, this is not about redirects is it? (We had an issue with redirects some time ago.)
Comment 2 Andre Klapper 2013-09-27 15:30:24 UTC
Am I right that this only happens to users who are not logged in?

According to comments on de.wp [[WP:BYPASS]] in Firefox helps.
Comment 3 Greg Grossmeier 2013-09-27 16:04:43 UTC
cc'ing a couple more people (Bryan and Brad) to ask for them to take a quick look at this one.
Comment 4 Brad Jorsch 2013-09-27 16:32:52 UTC
It seems we've been having a lot of caching issues lately, for example bug 54513.

What we really need to start with are captures of the HTTP request and response headers that are returning out-of-date content.
Comment 5 Bryan Davis 2013-09-27 16:46:04 UTC
10:40	paravoid	we had a multicast/htcp outage from the 22nd and until yesterday
10:40	paravoid	it was a network issue
Comment 6 Volker Alexander 2013-09-27 17:44:27 UTC
Hi,
"until yesterday": Yes, I notice something has changed. But it's not solved yet completely.
Let me give a summary of my experience:
- Users logged in see current versions, but users not logged in don't
- The above mentioned change concerns the following: Initially users not logged in didn't see the current version in the version history list page neither, but now this is the case
- Finally something that struggles me: Soft and hard reload (like Ctrl + F5 in Firefox) don't help when using IE or Opera. Same is true for even entirely deleting cache and profile paths (testet with Opera 12). Amazingly only Firefox' hard reload (Ctrl + F5) had an effect, namely showing the desired current version of a page. Aaaand: After doing so, browsing that page with IE or Opera also showed the correct current version, regardless of having refreshed the cache again or not.
Strange, isn't it? At least for me. Maybe someone of you tech experts has an explanation.
From my feeling I would suppose that the difference between Firefox' vs Opera 12's hard reloading methods and its effects could lead to an explanation. What do you think?
Liebe Grüße, kind regards from Berlin,
Volker Alexander
Comment 7 Volker Alexander 2013-09-27 19:04:26 UTC
I would like to add following information:
- The problem doesn't occur with pages which are edited now (testet 2013 09 27, ca. 20:00 +02:00). It occurs with paged edited in the last days.
- Might be an unnecessary detail: Being logged out after havin been logged in (i. e. with some cookies left, I assume) I see the current version. After erasing cache I see the outdated one.
- BTW I forgot to mention that I believe to remember we had this or a quite similar problem already. I know it's not helpful this way. Just want to mention if anyone else remembers that better than I do.
- PerfektesChaos mentioned an http snapshot. I'm not sure what it is, but I browsed "http://de.wikipedia.org/wiki/Convair_F2Y", which is one of those problematic pages, and recorded traffic with Capsa 7 Free, Version 7.4.1, Build 2626. Hope this is helpful.
Weekend starting! Everybody have a nice one!
Volker
Comment 8 Volker Alexander 2013-09-27 19:07:19 UTC
Created attachment 13395 [details]
Traffic record of an example of the outdated version issue
Comment 9 Bryan Davis 2013-09-27 20:51:39 UTC
Looks like we need to send a bunch of purge messages to the cache layer to drop things that were affected by the htcp outage.

I'm working on figuring out how to figure out what those things are. :) If someone already knows how to do this they should, but I'll keep poking at it and try to make some progress.
Comment 10 Bryan Davis 2013-09-27 22:06:10 UTC
With Reedy's help we just sent purge messages for 34,447 pages on dewiki and 146,760 pages on enwiki. Hopefully this will make things a little better for anon users.
Comment 11 Bryan Davis 2013-09-27 22:35:46 UTC
Writeup of steps taken at https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Notes/Finding_Files_To_Purge.

TL;DR:
- ran sql against tools-lab slaves to find probably affected pages
- massaged data into full URLs
- Reedy sent the purges

The purges that were sent were only for the 0-namespace (main wiki). There are doubtless many other pages that are still stale in varnish. It was also only done for enwiki and dewiki. Other wikis are likely affected as well.
Comment 12 Bryan Davis 2013-09-27 23:23:50 UTC
Update: the data that I gave Reedy for purging earlier was malformed. He re-ran the purge with (hopefully) correct data around 2013-09-27T23:15Z.
Comment 13 Volker Alexander 2013-09-28 17:58:41 UTC
I would like to ask something:
How is it possible that Firefox' hard reload has another effect on wiki servers than the one of other browsers?
An empty cache (I mean really deleted, with an entirely new profile) loads a page the same way like a hard reload, I assume. Just to clarify: I consider a soft reload as a reload from browser's cache and a hard reload as a reload from network. So how come that wiki servers are delivering an old version to Opera and IE but a current one to Firefox?
Greetings,
Volker
Comment 14 Bawolff (Brian Wolff) 2013-09-28 18:06:29 UTC
(In reply to comment #13)
> I would like to ask something:
> How is it possible that Firefox' hard reload has another effect on wiki
> servers
> than the one of other browsers?
> An empty cache (I mean really deleted, with an entirely new profile) loads a
> page the same way like a hard reload, I assume. Just to clarify: I consider a
> soft reload as a reload from browser's cache and a hard reload as a reload
> from
> network. So how come that wiki servers are delivering an old version to Opera
> and IE but a current one to Firefox?
> Greetings,
> Volker

Were you previously logged in on firefox (even if you weren't when you did the test)? Cookies from your login session (or a recent login session, or if you made an edit recently and weren't logged in when doing it) will cause the browser to skip the varnish cache layer.

Otherwise possible coincidence (If you did the IE and safari test first, and did the firefox test last - and if the bug was fixed just before starting the firefox test. However that seems kind of unlikely).

Other then those possible issues, I have no idea
Comment 15 Brad Jorsch 2013-09-28 18:40:40 UTC
(In reply to comment #13)
> How is it possible that Firefox' hard reload has another effect on wiki
> servers than the one of other browsers?
> An empty cache (I mean really deleted, with an entirely new profile) loads a
> page the same way like a hard reload, I assume.

A normal page load is likely to send headers that allow intermediate caches to return a cached version of the page, while a hard reload will probably include headers that instruct intermediate caches not to cache.

In both Firefox and Chromium here a normal page load doesn't send any "Cache-Control" or "Pragma" headers, a normal reload (F5) sends "Cache-Control: max-age=0" (but no "Pragma"), and a hard reload (Ctrl+F5) sends both "Cache-Control: no-cache" and "Pragma: no-cache". Opera 16 seems to do the same thing.

On the other hand, IE 9 and 10 and Opera 12 seem to send a "Cache-Control: no-cache" with no "Pragma" header on any reload (both F5 and Ctrl+F5). I don't know whether that's actually the reason, but it's certainly plausible that Wikimedia's caching layer was only revalidating in response to "Pragma: no-cache" and not "Cache-Control: no-cache".
Comment 16 Bawolff (Brian Wolff) 2013-09-28 18:46:09 UTC
(In reply to comment #15)
> (In reply to comment #13)
> > How is it possible that Firefox' hard reload has another effect on wiki
> > servers than the one of other browsers?
> > An empty cache (I mean really deleted, with an entirely new profile) loads a
> > page the same way like a hard reload, I assume.
> 
> A normal page load is likely to send headers that allow intermediate caches
> to
> return a cached version of the page, while a hard reload will probably
> include
> headers that instruct intermediate caches not to cache.
> 
> In both Firefox and Chromium here a normal page load doesn't send any
> "Cache-Control" or "Pragma" headers, a normal reload (F5) sends
> "Cache-Control:
> max-age=0" (but no "Pragma"), and a hard reload (Ctrl+F5) sends both
> "Cache-Control: no-cache" and "Pragma: no-cache". Opera 16 seems to do the
> same
> thing.
> 
> On the other hand, IE 9 and 10 and Opera 12 seem to send a "Cache-Control:
> no-cache" with no "Pragma" header on any reload (both F5 and Ctrl+F5). I
> don't
> know whether that's actually the reason, but it's certainly plausible that
> Wikimedia's caching layer was only revalidating in response to "Pragma:
> no-cache" and not "Cache-Control: no-cache".

Interesting. OTOH wget -S --header 'cache-control: no-cache' --header 'pragma: no-cache' 'http://en.wikipedia.org/wiki/Main_Page' has headers that still suggest a cache hit.
Comment 17 Volker Alexander 2013-09-28 19:25:42 UTC
@Brian in comment #14:
No, I wasn't logged in, neither before. I just deleted the cache and startet Firefox again.
And yes, I agree with you. A conicidence seems to be unlikely, especially when we consider that this phenomenon already occured on at least 25th.

@Brad in comment #15:
Thanks for the lesson. I didn't know that and didn't think in this direction neither.

@all:
Would it be much effort to instruct the servers to deliver a cached version on first call and always purge their cache on subsequent calls, whichever method and whichever browser those subsequent calls may be generated?
Comment 18 Platonides 2013-09-28 19:36:57 UTC
Volker, the server don't know if you reloaded the page or it's the first time you visit it.

Perhaps those HIT mean that Varnish checked with the apaches and they returned that the page didn't change (and thus they could reuse the stored page) ?

Thanks for the long explanation Brad (I would have had to dig the behaviors). If Varnish is accepting Pragma: no-cahe but not Cache-Control: no-cache they should be fixed: «HTTP/1.1 caches SHOULD treat "Pragma: no-cache" as if the client had sent "Cache-Control: no-cache".»
Comment 19 Bawolff (Brian Wolff) 2013-09-28 19:46:20 UTC
If that were the case, I would expect that doing hard refresh on firefox of a specific thumbnail, during one of the varnish cache issues would result in the correct thumb being returned, which doesn't happen.
Comment 20 Platonides 2013-09-29 18:11:37 UTC
Bawolff? For refreshing thumbnails you need to purge the image page.
Comment 21 Brad Jorsch 2013-09-29 21:36:24 UTC
(In reply to comment #19)
> If that were the case, I would expect that doing hard refresh on firefox of a
> specific thumbnail, during one of the varnish cache issues would result in
> the
> correct thumb being returned, which doesn't happen.

I believe thumbnails are handled differently, in that the underlying file at the requested size is created if it doesn't already exist and is then served for subsequent requests (possibly with more caching in front). Whereas there isn't any "underlying file" for articles, it's regenerated (although not necessarily reparsed) every time a request for it makes it past Varnish/Squid.

I could be mistaken though.
Comment 22 Umherirrender 2013-09-30 06:43:44 UTC
It is possible to install a monitoring for the htcp purge requests?
It take some days to be found and that means many readers gets old versons, that is bad.

A cron job with a script is needed, that pings from the web server to the cache server to look, if that network connection is still active and send mails or give another warning, when it is not.
Or a other way to found outage here. Thanks.
Comment 23 Andre Klapper 2013-09-30 09:36:29 UTC
(In reply to comment #22)
> It is possible to install a monitoring for the htcp purge requests?

That was more or less bug 43449 - Brandon might be able to elaborate.
Comment 24 Bawolff (Brian Wolff) 2013-09-30 11:34:56 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > It is possible to install a monitoring for the htcp purge requests?
> 
> That was more or less bug 43449 - Brandon might be able to elaborate.


There are graphs in ganglia (look at vhtcpd_packets_deqeued metric for the various varnish servers. It should be an increasing slope. During the outage it was a flat line). I've heard talk of also making the metric be a delta instead of total number of purges ever and making alarm bells go off for that metric instead of the current silent recording, but I don't know the status of either idea.
Comment 25 Bawolff (Brian Wolff) 2013-09-30 11:39:22 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > (In reply to comment #22)
> > > It is possible to install a monitoring for the htcp purge requests?
> > 
> > That was more or less bug 43449 - Brandon might be able to elaborate.
> 
> 
> There are graphs in ganglia (look at vhtcpd_packets_deqeued metric for the
> various varnish servers. It should be an increasing slope. During the outage
> it
> was a flat line). I've heard talk of also making the metric be a delta
> instead
> of total number of purges ever and making alarm bells go off for that metric
> instead of the current silent recording, but I don't know the status of
> either
> idea.

Hmm looks like somebody did the delta thing (yay). Also I got the metric name wrong. I meant http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=vhtcpd_inpkts_dequeued&s=by+name&c=Upload+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
Comment 26 Gerrit Notification Bot 2013-10-01 17:40:56 UTC
Change 86883 had a related patch set uploaded by BryanDavis:
New maintenance script to purge edited pages.

https://gerrit.wikimedia.org/r/86883
Comment 27 Bryan Davis 2013-10-01 18:04:14 UTC
Tim suggested that a new maintenance script be developed to make recovering from this type of HTCP outage/network partition easier. https://gerrit.wikimedia.org/r/86883 is my attempt at creating said script.

My intent is that it could be used something like:

  mwscript purgeEditedFiles.php --starttime 20130922000000 \
    --endtime 20130926000000 \
    --htcp-dest $IP_OF_AMS_RELAY \
    --wiki $SOME_AFFECTED_WIKI
Comment 28 Nemo 2013-10-01 18:10:35 UTC
Thanks; broadening the subject, as de.wiki is currently the only one (with en.wiki) where the problem was fixed and the script should allow recovering from it on all other wikis too.
Comment 29 Gerrit Notification Bot 2013-10-08 00:17:55 UTC
Change 86883 merged by jenkins-bot:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/86883
Comment 30 Gerrit Notification Bot 2013-10-09 16:51:30 UTC
Change 88757 had a related patch set uploaded by BryanDavis:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/88757
Comment 31 Gerrit Notification Bot 2013-10-09 18:15:59 UTC
Change 88757 merged by jenkins-bot:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/88757
Comment 32 Bawolff (Brian Wolff) 2013-10-11 11:31:10 UTC
Hmm. At this point we are not that far away from the point in time where things will naturally fall out of cache (i believe there is a maxage of 30 days)
Comment 33 Bryan Davis 2013-10-11 16:15:01 UTC
(In reply to comment #32)
> Hmm. At this point we are not that far away from the point in time where
> things will naturally fall out of cache (i believe there is a maxage of
> 30 days)

I have a deploy window reserved starting at 2013-10-16T20:00:00Z to purge additional pages. It has taken longer than I would have liked to get to this point, but on the positive side we are getting tools in place that will make dealing with this sort of event easier in the future.
Comment 34 Bryan Davis 2013-10-18 01:07:22 UTC
The job finished at 2013-10-17T02:43Z. All in all we sent HTCP purges
for a little over 1.6M pages in esams caches.
Comment 35 Volker Alexander 2013-11-03 22:50:32 UTC
Hi,
it just happened again. To be able to see the updated version in IE, I had to use Firefox and do a hard reload. Only then IE showed the recent version. It's like before, with emptied caches and so on.
Comment 36 Volker Alexander 2013-11-03 23:49:35 UTC
Here are two examples:
https://de.wikipedia.org/wiki/Massenmedien
and:
https://de.wikipedia.org/wiki/Massendrucksache
Edited as user (being logged in) with Opera. Afterwards viewed with other browsers, namely Opera, IE and Firefox, and not logged in.
The problem is back. Only I don't know the reason. What do you think?
Greetings,
Volker
Comment 37 jeremyb 2013-11-04 01:07:26 UTC
forked to a new bug: bug 56545

reclosing

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links