Last modified: 2014-04-01 10:21:04 UTC
http://en.m.wikipedia.beta.wmflabs.org/ currently gives the following response: Request: GET http://en.wikipedia.beta.wmflabs.org/, from 127.0.0.1 via deployment-cache-mobile03 deployment-cache-mobile03 ([127.0.0.1]:3128), Varnish XID 723150036 Forwarded for: 216.38.130.164, 127.0.0.1 Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:09:57 GMT
Request: GET http://de.wikipedia.beta.wmflabs.org/wiki/, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 105897343 Forwarded for: 78.94.xxx.xxx, 127.0.0.1 Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:17:47 GMT Request: GET http://bits.beta.wmflabs.org/images/wikimedia-button.png, from 78.94.153.111 via deployment-cache-bits01 deployment-cache-bits01 ([10.68.16.12]:80), Varnish XID 90403984 Forwarded for: 78.94.xxx.xxx Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:40:55 GMT pages while logged out (no cookies) are basically served or are hitting cache(?), but bits also doesn't work. sometimes also the connection times out
Change 122436 had a related patch set uploaded by Hashar: beta: lower # of procs on jobrunner https://gerrit.wikimedia.org/r/122436
The CirrusSearch update job kicked it and started parsing the whole simplewiki which is a big large for the beta cluster. Due to our jobrunner (deployment-jobrunner01) being configured like production (launching a lot of jobs), the jobs were starving the application servers by querying /w/api.php ... I lowered the number of job runners with https://gerrit.wikimedia.org/r/#/c/122436/ There might be some other issue.
I tried restarting both apaches, without much success. Eventually killed the parsoid daemon which was spamming the application server as well.
The root cause is definitely parsoid doing a lot of queries on the Api service.
So Parsoid was attempting to parse all of simplewiki. I have stopped the daemon and restarted it. Monitoring /var/log/parsoid/parsoid.log it is all quiet on that front now so the API application servers are no more hammered. Also bits might be
Also bits might be fully loaded by now.
I think the issue is solved now. Root cause was Parsoid attempting to fetch a bunch of page info from the API server for some reasons. Restarting Parsoid apparently stopped the spam.
would that make https://gerrit.wikimedia.org/r/#/c/122436/ obsolete or not
(In reply to Daniel Zahn from comment #9) > would that make https://gerrit.wikimedia.org/r/#/c/122436/ obsolete or not That one lower the number of jobs run in parallel on the jobrunner01 instance. Unrelated but still a good thing to have, the instance is less powerful than our prod servers.
Change 122436 merged by Alexandros Kosiaris: beta: lower # of procs on jobrunner https://gerrit.wikimedia.org/r/122436