Last modified: 2013-08-14 22:04:56 UTC
Getting 503 errors when I try to log in on mobile or desktop on betalabs. I checked the logs and I don't any errors related to Mobile frontend but I do see errors in the fatal.log related to MWScript.php and MWMultiVersion.php.
Yesterday I tried restarting varnish on deployment-cache-text1 and restarting apache on deployment-apache2 with no effect.
Today restarted memcached on deployment-memc0 and -memc1, no effect. Nik and I continue to experiment and we've asked Ryan Lane also.
503 means the backend service is not reachable. I made a few connections tests and they never reach the Apaches backends so there must be something weird happening at Varnish level (deployment-cache-text1 instance). I did some simple refreshes in my browser against a random page ( http://en.wikipedia.beta.wmflabs.org/wiki/Dido_Sotiriou ) that gave me the 503 timeout when using a browser though I had no issue getting the page served via curl (which sends no header). Using curl I get the page served by varnish text frontend. X-Cache from three requests: X-Cache: deployment-cache-text1 hit (4), deployment-cache-text1 frontend hit (39) X-Cache: deployment-cache-text1 hit (4), deployment-cache-text1 frontend hit (40) X-Cache: deployment-cache-text1 hit (4), deployment-cache-text1 frontend hit (41) Using a browser I get: X-Cache:deployment-cache-text1 miss (0), deployment-cache-text1 frontend miss (0) Some header(s) being send by the browser cause the request to not be cacheable. That in turns overload the Apache backends which takes a long time to server the request which might lead varnish to serve a 503 whenever the timeout as been reached.
Setting priority/importance as this is blocking testing pretty badly.
so Varnish is not configured properly? some examples like these? https://www.varnish-cache.org/trac/wiki/VCLExamples
I don't think the apaches overload. Typically Special pages don't get cached. On the live site Special:UserLogin is requested live every time from the apaches. And actually when I request it via curl directly from the apache in labs the response time is quite fast (1 second): curl -v -H 'Host: en.wikipedia.beta.wmflabs.org' 'http://deployment-apache32/w/index.php?title=Special:UserLogin' > out > GET /w/index.php?title=Special:UserLogin HTTP/1.1 > User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 > Accept: */* > Host: en.wikipedia.beta.wmflabs.org > 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0< HTTP/1.1 200 OK Additionally, when I try curl from my laptop: curl -v 'http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin' > out this times out, rather than giving me a hit: > GET /w/index.php?title=Special:UserLogin HTTP/1.1 > User-Agent: curl/7.27.0 > Host: en.wikipedia.beta.wmflabs.org > Accept: */* > 0 0 0 0 0 0 0 0 --:--:-- 0:00:27 --:--:-- 0< HTTP/1.1 503 Service Unavailable ... < X-Cache: deployment-cache-text1 miss (0), deployment-cache-text1 frontend miss (0) Still investigating.
Turned out to be much simpler. Apache on deployment-apache33 was not really running (the parent process was alive but nothing else). Shot and restarted and now login works :-)
Now we seem to have no js/css though. I'll look for an issue on deployment-cache-bits03 but I'm not sure what to look for...
I am getting js and css both at login and on pages I view afterwards (that are slow enough that I'm pretty sure they are rendered and not cached).
OK, seems to better now, thanks very much!
(In reply to comment #7) > Turned out to be much simpler. Apache on deployment-apache33 was not really > running (the parent process was alive but nothing else). Shot and restarted > and now login works :-) Ariel rocks. I thought you about in of the app server not responding but since both had apachen running I did not investigate that much. We definitely need Icinga monitoring :)
I have filled bug 52867 to have the Apache service being monitored.