Last modified: 2014-05-17 08:17:13 UTC
When deploying 1.18, we had many issues with this package-loading URL: http://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=en&modules=ext.gadget.mwEmbed%7Cext.uploadwizard.mediawiki.language.parser%7Cjquery.autoEllipsis%2CcheckboxShiftClick%2CcollapsibleTabs%2Ccookie%2CdelayedBind%2ChighlightText%2CmakeCollapsible%2CmessageBox%2CmwPrototypes%2Cplaceholder%2Csuggestions%2CtabIndex%7Cmediawiki.Uri%2Chtmlform%2Clanguage%2Cuser%2Cutil%7Cmediawiki.action.watch.ajax%7Cmediawiki.legacy.ajax%2Cmwsuggest%2Cwikibits%7Cmediawiki.page.ready&skin=vector&version=20111005T161514Z&* Varnish would return a 503 Service Not Available error after 5 seconds (plus a few hundredths of a second). There were a few other load.php URLs which also returned 503, but I didn't make a note of them. I guessed that this was due to a timeout when RL was compressing, packaging, or caching this URL. We initially tried upping the first_bytes_timeout of the Varnish servers to 10 seconds, which seemed to work for a few minutes, but then we started getting similar 503 errors after just over 10 seconds. I tried calling some Apaches directly with the problematic URL (hoping that RL would then memcache the results) which may have worked. But then the JS messages didn't work; in their place we got the message key names in brackets, e.g. a button labelled "[mwe-upwiz-some-button]". This is what the message library does when it can't find the translated message. While I was investigating then at some time later they seemed to fix themselves. Then I went home. There are probably a lot of confounding issues with the 1.18 rollout, and the fact that we were scapping every now and then for other reasons. I am just speculating here, but if scap touches every file, that will cause RL to unnecessarily re-package JS, since it does not check file contents, only the last modified time. But in any case, that shouldn't matter; the underlying issue is that rebuilding the packages can take an inordinately long amount of time.
Wasn't this fixed by r99010?
(In reply to comment #1) > Wasn't this fixed by r99010? I don't even know what the issue was in production. Maybe it is fixed by r99010.
As mentioned on private-l, I have two plans to improve RL performance a little bit: * Defragment the minification/transformation caches in memcached by caching each module separately rather than caching full responses * Fix the RL registration performance issues Domas has been talking to me about, which cause a small performance hit in the MediaWiki initialization phase (i.e. during every non-Squid-cached request) However, I maintain that we haven't actually seen "real" slowness. Yesterday's slowness was caused by DB slowness which was in turn caused by RL overloading the DB with TRUNCATE queries. The slowness in May was caused by a flaw in the cache freshness logic combined with a bug that caused i18n recaching to happen for every language rather than just for the requested language, meaning i18n recaching (which happened on every request due to the broken cache freshness check) was 278 times as slow. "Things break if they suddenly get 100+ times slower due to a bug" is not a bug in itself. That said, performance improvements are always good, so I'll work on getting those two improvements in.
(In reply to comment #3) > * Defragment the minification/transformation caches in memcached by caching > each module separately rather than caching full responses > * Fix the RL registration performance issues Domas has been talking to me > about, which cause a small performance hit in the MediaWiki initialization > phase (i.e. during every non-Squid-cached request) > What's the status on these two? I thought the former was solved in the mean time, right?
Assuming this fixed.