Last modified: 2014-04-01 10:21:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65315, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63315 - beta.wmflabs.org unreachable (503 error) after migration to eqiad
beta.wmflabs.org unreachable (503 error) after migration to eqiad
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Unprioritized blocker
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-31 18:20 UTC by Ryan Kaldari
Modified: 2014-04-01 10:21 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ryan Kaldari 2014-03-31 18:20:18 UTC
http://en.m.wikipedia.beta.wmflabs.org/ currently gives the following response:

Request: GET http://en.wikipedia.beta.wmflabs.org/, from 127.0.0.1 via deployment-cache-mobile03 deployment-cache-mobile03 ([127.0.0.1]:3128), Varnish XID 723150036
Forwarded for: 216.38.130.164, 127.0.0.1
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:09:57 GMT
Comment 1 se4598 2014-03-31 18:45:41 UTC
Request: GET http://de.wikipedia.beta.wmflabs.org/wiki/, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 105897343
Forwarded for: 78.94.xxx.xxx, 127.0.0.1
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:17:47 GMT

Request: GET http://bits.beta.wmflabs.org/images/wikimedia-button.png, from 78.94.153.111 via deployment-cache-bits01 deployment-cache-bits01 ([10.68.16.12]:80), Varnish XID 90403984
Forwarded for: 78.94.xxx.xxx
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:40:55 GMT


pages while logged out (no cookies) are basically served or are hitting cache(?), but bits also doesn't work.
sometimes also the connection times out
Comment 2 Gerrit Notification Bot 2014-03-31 19:43:15 UTC
Change 122436 had a related patch set uploaded by Hashar:
beta: lower # of procs on jobrunner

https://gerrit.wikimedia.org/r/122436
Comment 3 Antoine "hashar" Musso (WMF) 2014-03-31 19:44:50 UTC
The CirrusSearch update job kicked it and started parsing the whole simplewiki which is a big large for the beta cluster.  Due to our jobrunner (deployment-jobrunner01) being configured like production (launching a lot of jobs), the jobs were starving the application servers by querying /w/api.php ...

I lowered the number of job runners with https://gerrit.wikimedia.org/r/#/c/122436/

There might be some other issue.
Comment 4 Antoine "hashar" Musso (WMF) 2014-03-31 19:59:13 UTC
I tried restarting both apaches, without much success.  Eventually killed the parsoid daemon which was spamming the application server as well.
Comment 5 Antoine "hashar" Musso (WMF) 2014-03-31 20:01:58 UTC
The root cause is definitely parsoid doing a lot of queries on the Api service.
Comment 6 Antoine "hashar" Musso (WMF) 2014-03-31 20:19:05 UTC
So Parsoid was attempting to parse all of simplewiki. I have stopped the daemon and restarted it.  Monitoring /var/log/parsoid/parsoid.log it is all quiet on that front now so the API application servers are no more hammered.

Also bits might be
Comment 7 Antoine "hashar" Musso (WMF) 2014-03-31 20:19:21 UTC
Also bits might be fully loaded by now.
Comment 8 Antoine "hashar" Musso (WMF) 2014-03-31 20:51:04 UTC
I think the issue is solved now. Root cause was Parsoid attempting to fetch a bunch of page info from the API server for some reasons. Restarting Parsoid apparently stopped the spam.
Comment 9 Daniel Zahn 2014-03-31 21:20:37 UTC
would that make https://gerrit.wikimedia.org/r/#/c/122436/ obsolete or not
Comment 10 Antoine "hashar" Musso (WMF) 2014-03-31 21:53:05 UTC
(In reply to Daniel Zahn from comment #9)
> would that make https://gerrit.wikimedia.org/r/#/c/122436/ obsolete or not

That one lower the number of jobs run in parallel on the jobrunner01 instance. Unrelated but still a good thing to have, the instance is less powerful than our prod servers.
Comment 11 Gerrit Notification Bot 2014-04-01 10:21:04 UTC
Change 122436 merged by Alexandros Kosiaris:
beta: lower # of procs on jobrunner

https://gerrit.wikimedia.org/r/122436

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links