Last modified: 2014-07-14 13:50:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69849, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67849 - If-Modified-Since handling is broken
If-Modified-Since handling is broken
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-11 11:44 UTC by bianjiang
Modified: 2014-07-14 13:50 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description bianjiang 2014-07-11 11:44:50 UTC
We (crawling team in Google) found that Wikipedia's If-Modified-Since handling is broken, at least when it comes to (Wikipedia-style) redirects/symlinks.

As a short-term work-around (in order to serve up-to-date content), we temporarily stopped sending out If-Modified-Since header in the crawl request.

Hope you take a look at this issue, and let us know when it is resolved.



Below is what we did to reproduce the issue (Actually, we have observed many times that we cannot fetch the latest content of an articles before)

1. Pick a Wikipedia page to vandalize it (apologies for being a vandal; we promise it is for the greater good...), and wait for it is reverted.
http://en.wikipedia.org/wiki/SSh (note the capitalization).  Its history (http://en.wikipedia.org/w/index.php?title=SSh&action=history) will contains the vandalization at 01:47:42 (GMT) and its rollback at 01:49:06 (GMT)

2. Fetch the url (at 01:47:54 GMT, right after the vandalized revision is submitted) using telnet.

It seems the "Last-Modified" is using the last redirect destination's latest revision: 08 Jul 2014 11:49:08 GMT (http://en.wikipedia.org/w/index.php?title=SSL&action=history)

$ telnet en.wikipedia.org 80
Trying 2620:0:861:ed1a::1...
Connected to text-lb.eqiad.wikimedia.org.
Escape character is '^]'.
GET /wiki/SSh HTTP/1.1
Host: en.wikipedia.org

HTTP/1.1 200 OK
Server: Apache
X-Content-Type-Options: nosniff
Content-language: en
X-UA-Compatible: IE=Edge
Vary: Accept-Encoding,Cookie
Last-Modified: Tue, 08 Jul 2014 11:49:08 GMT
Content-Type: text/html; charset=UTF-8
X-Varnish: 4282068392, 1036646122
Via: 1.1 varnish, 1.1 varnish
Transfer-Encoding: chunked
Date: Thu, 10 Jul 2014 01:47:54 GMT ← This one's accurate; this is when the crawl happened.
Age: 0
Connection: keep-alive
X-Cache: cp1065 miss (0), cp1053 frontend miss (0)
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Set-Cookie: GeoIP=::::v6; Path=/; Domain=.wikipedia.org

…Page for SSL follows…


3. After it's reverted, re-crawl (at 01:50:57 GMT) the article with a If-Modified-Since header, setting to the Last-Modified value in step2 (we are simulating the production crawling works).
It's strange that we got a 304 response this time. And the "Last-Modified" value is weird too, neither SSh nor SSH has update around 08 Jul 2014 11:49:08 GMT.

$ telnet en.wikipedia.org 80
Trying 2620:0:861:ed1a::1...                                                    
Connected to text-lb.eqiad.wikimedia.org.  
Escape character is '^]'.
GET /wiki/SSh HTTP/1.1                                                          
Host: en.wikipedia.org  
If-Modified-Since: Tue, 08 Jul 2014 11:49:08 GMT
                                                                                
HTTP/1.1 304 Not Modified
Server: Apache
X-Content-Type-Options: nosniff
Content-language: en
X-UA-Compatible: IE=Edge
Vary: Accept-Encoding,Cookie
Last-Modified: Thu, 26 Jun 2014 11:22:21 GMT ← A mysterious timestamp.
Content-Type: text/html; charset=UTF-8
X-Varnish: 4282315466 4282211494, 2309645493
Via: 1.1 varnish, 1.1 varnish
Date: Thu, 10 Jul 2014 01:50:57 GMT
Age: 74  
Connection: keep-alive
X-Cache: cp1065 hit (1), cp1066 frontend miss (0)
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Set-Cookie: GeoIP=::::v6; Path=/; Domain=.wikipedia.org

Connection closed by foreign host.
Comment 1 Andre Klapper 2014-07-11 12:24:16 UTC
Thanks for reporting this! 
Wondering who could investigate here. Greg?
Comment 2 Brad Jorsch 2014-07-11 14:14:02 UTC
The cause of the problem is simple enough: when MediaWiki follows a redirect, it uses the page_touched timestamp of the target page.[1]

A simple change to fix this bug would be to adjust Article::view() to use the maximum of the target page's page_touched or the redirect's page_touched when calling OutputPage::checkLastModified().


 [1]: @bianjiang: MediaWiki stores a "touched" timestamp for each page, which is updated when the page is edited, when a template transcluded on the page is edited, and in other cases too. The current page_touched timestamp for the SSH article is 2014-06-26T11:22:21Z,[2] which matches what you're seeing.
 [2]: https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=info&titles=SSH
Comment 3 Gerrit Notification Bot 2014-07-11 14:15:04 UTC
Change 145568 had a related patch set uploaded by Anomie:
Take redirect modifications into account for If-Modified-Since

https://gerrit.wikimedia.org/r/145568
Comment 4 Gerrit Notification Bot 2014-07-11 20:48:31 UTC
Change 145568 merged by jenkins-bot:
Take redirect modifications into account for If-Modified-Since

https://gerrit.wikimedia.org/r/145568
Comment 5 Brad Jorsch 2014-07-11 21:05:06 UTC
@bianjiang: You should be able to test this change in a few minutes on Beta Labs, see http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page for details on that site.

If nothing else happens, this change will go out to the non-Wikipedia production wikis on July 22 and to the Wikipedias on July 24. We could push it out on Monday, though, especially if testing on Beta Labs over the weekend indicates that all's well.
Comment 6 bianjiang 2014-07-14 04:51:07 UTC
@Brad, it seems I cannot test the it on deployment.wikimedia.beta.wmflabs.org, because i have no way to "revert" a revision (I just registered a new account there: bianjiang, but has no privilege to revert - undo seems always generate a new revision instead of "reverting" to an old one.
Comment 7 Brad Jorsch 2014-07-14 13:50:11 UTC
(In reply to bianjiang from comment #6)
> undo seems always generate a new revision instead of
> "reverting" to an old one.

That's how undo and other methods of reverting work in MediaWiki. The only way to actually "revert" as you're thinking is to delete the page entirely and then restore all but the most recent revision, which is not particularly encouraged.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links