Last modified: 2011-08-18 18:25:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T32428, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 30428 - Commons Main Page availability issue
Commons Main Page availability issue
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized critical (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 30431
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-17 20:17 UTC by Asher Feldman
Modified: 2011-08-18 18:25 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Asher Feldman 2011-08-17 20:17:57 UTC
The commons main page has been periodically unavailable this morning due to the poolqueue for this page filling up.

- http://commons.wikimedia.org/wiki/Main_Page isn't parser cacheable.  With debug logging enabled, "Parser output marked as uncacheable" is logged, which comes from Parser::disableCache. That seems to only be called from one place which requires ( $title->getNamespace() == NS_SPECIAL && $this->mOptions->getAllowSpecialInclusion() && $this->ot['html'] ) to be true.  I don't see anything Main_Page related in the special namespace, so not sure what's going on there? It also results in "don't cache" headers for squid.

- The poolcounter makes a lot of sense for hot / rapidly changing pages that can be parser cached.  One apache gets the lock, all others queue up behind it, or after 50, return an immediate error.  For a popular page that can't be parser cached, it really sucks.  All requests are serialized and stack up, resulting in very page load times, or immediate errors.  

- Pages like this are insanely easy to DOS - either deliberately with minimal effort or just due to natural traffic spikes. 

- Main Pages should probably all be parser cacheable and/or we should disable use of the poolqueue on pages that aren't. It currently seems like this isn't determined until after parsing however.
Comment 1 Bawolff (Brian Wolff) 2011-08-17 20:25:39 UTC
Appears to be caused by including <categorytree>. Not sure why thats triggering the special page transclusion cache killing stuff

I think we should still cache pages with transcluded special pages, just maybe for a limitted time (like {{CURRENTDAY}} does) since they no longer vary with url params in trunk, but that's a side issue.
Comment 2 Bawolff (Brian Wolff) 2011-08-17 21:00:24 UTC
Note, the special page transclusion thing is unrelated. Several extensions call Parser::disableCache, including CategoryTree.

What's really surprising is that Wikimedia doesn't have $wgCategoryTreeDisableCache = false; set. that setting really should be set to false for larger sites.
Comment 3 Sam Reed (reedy) 2011-08-17 22:09:40 UTC
PoolCounter has been re-enabled

$wgCategoryTreeDisableCache has been set to false too
Comment 4 Asher Feldman 2011-08-17 22:30:41 UTC
$wgCategoryTreeDisableCache = false did the trick, the commons home page is now getting parser cached as well as cached by squid.  

This issue appears to have arisen due to a DoS attack generating ~5k reqs/sec that got lucky and happened upon this week spot in our infrastructure.  Action has also been taken to block that traffic.  

We should check extensions used in production for Parser::disableCache calls as this general issue could hit us again elsewhere.
Comment 5 Bawolff (Brian Wolff) 2011-08-17 22:40:01 UTC
>We should check extensions used in production for Parser::disableCache calls as
>this general issue could hit us again elsewhere.


grepping says the following extensions can disable cache in some circumstance (going through the one's that are in /branches/wmf/1.17wmf1/extensions):

*DonationInterface
*Quiz
*CommunityVoice
*ScanSet

Quiz is probably the only one that is really widely used. I'm unsure if anything uses CommunityVoice or ScanSet anymore and DonationInterface is probably something that might be an exception. In core you can do stuff like {{special:recentchanges}} which will disable cache which probably don't really need to disable cache (Especially for things like {{special:prefixindex/foo}})
Comment 6 Rob Lanphier 2011-08-18 16:37:30 UTC
Re-enabling the cache seems to no only have solved the problem, but (not too surprisingly) brought page load times down pretty substantially:
http://status.wikimedia.org/8777/163404/Wiki-commons-%28s4%29

The downside is that I imagine we're going to start getting complaints about the CategoryTree being out of date.  It seems as though completely disabling the cache is *very* rarely the right answer in production, and that setting a very short time-to-live on the parser cache (e.g. 5 minutes) will be good enough in 99% of cases.  Is setting parser cache TTL something that is as easy for extension authors to do as disabling the cache entirely?
Comment 7 Mark A. Hershberger 2011-08-18 18:25:57 UTC
Discussion of Comment #6 branched to Bug #30448 since the issue in this bug is resolved.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links