Last modified: 2014-11-08 15:51:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74366, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72366 - HHVM fcgi restart during scap runs cause 503s (and failed tests)
HHVM fcgi restart during scap runs cause 503s (and failed tests)
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: High major
: ---
Assigned To: Nobody - You can work on this!
: browser-test-bug
: 73010 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-22 17:38 UTC by Chris McMahon
Modified: 2014-11-08 15:51 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Chris McMahon 2014-10-22 17:38:29 UTC
I am seeing a number of 503s from beta labs but upon checking, everything looks normal.  Here is an example: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/136/testReport/junit/(root)/Notification%20types/The___message_is_not_a_keyword/

Possibly related to https://bugzilla.wikimedia.org/show_bug.cgi?id=72162
Comment 1 Chris McMahon 2014-10-22 17:42:31 UTC
adding Ori in case this might be a consequence of HHVM
Comment 2 Greg Grossmeier 2014-10-29 18:49:28 UTC
Can we have timing information to help diagnose?

See also: https://tools.wmflabs.org/nagf/?project=deployment-prep
Comment 4 Greg Grossmeier 2014-10-29 19:53:38 UTC
Reedy/Antoine: Can you two diagnose this one, please?

Chris: Having more than two data points would be helpful (there might be a pattern eg tests running while code/db updates).
Comment 5 Chris McMahon 2014-10-29 19:57:30 UTC
Greg: it happens so often that I hadn't tried to track it closely.
Comment 6 Bryan Davis 2014-10-29 22:23:23 UTC
503s could very well could be caused by https://gerrit.wikimedia.org/r/#/c/163078/ which restarts the hhvm server as part of each scap. While the hhvm fcgi process is shutting down and starting back up Apache would be likely to return a 503. I actually commented on this outcome in code review:

> In beta I'm seeing errors that line up with hhvm restarts:
> [proxy_fcgi:error] [pid 10116] [client 10.68.16.12:13867] AH01079: failed to make connection to backend: 127.0.0.1
> In beta we don't use pybal so there is no way at all to depool/repool around a 
> restart. In production I don't think that we have any automatable means for 
> depooling/repooling at this time. How worried should I be about a 503 storm if we put 
> this into production? Without depooling it is going to happen.
Comment 7 Greg Grossmeier 2014-10-29 22:29:09 UTC
That is a really strong culprit....
Comment 8 Ori Livneh 2014-10-29 22:30:35 UTC
> Greg: it happens so often that I hadn't tried to track it closely.

How often do we run scap?
Comment 9 Bryan Davis 2014-10-29 22:31:27 UTC
(In reply to Ori Livneh from comment #8)
> > Greg: it happens so often that I hadn't tried to track it closely.
> 
> How often do we run scap?

On every commit to core or an extension.
Comment 10 Chris McMahon 2014-10-29 22:41:43 UTC
163078 was merged 14 October, the timing looks good for this being the culprit.

It seems to occur with increasing frequency, but that could be my bias.
Comment 11 Bryan Davis 2014-10-29 22:45:30 UTC
(In reply to Chris McMahon from comment #10)
> 163078 was merged 14 October, the timing looks good for this being the
> culprit.
> 
> It seems to occur with increasing frequency, but that could be my bias.

If restarting hhvm is the cause, the frequency would depend on the level of gerrit +2 activity, the number of tests running and the time needed for the hhvm fcgi process to restart. Mostly impossible to calculate before hand. Without a method to de-pool and drain an hhvm fcgi service before restarting it it will be quite likely that there are some failed requests.
Comment 12 Greg Grossmeier 2014-10-29 22:45:56 UTC
(In reply to Bryan Davis from comment #9)
> (In reply to Ori Livneh from comment #8)
> > > Greg: it happens so often that I hadn't tried to track it closely.
> > 
> > How often do we run scap?
> 
> On every commit to core or an extension.

Pedantic: 
https://git.wikimedia.org/blob/integration%2Fconfig/HEAD/jjb%2Fbeta.yaml#L196

every 10 minutes we kick off a run
Comment 13 Greg Grossmeier 2014-10-29 22:47:08 UTC
(In reply to Bryan Davis from comment #11)
> (In reply to Chris McMahon from comment #10)
> > 163078 was merged 14 October, the timing looks good for this being the
> > culprit.
> > 
> > It seems to occur with increasing frequency, but that could be my bias.
> 
> If restarting hhvm is the cause, the frequency would depend on the level of
> gerrit +2 activity, the number of tests running and the time needed for the
> hhvm fcgi process to restart. Mostly impossible to calculate before hand.
> Without a method to de-pool and drain an hhvm fcgi service before restarting
> it it will be quite likely that there are some failed requests.

Is this (where "this" == "method to de-pool and drain an hhvm fcgi service before restarting") something that is planned to be added to scap for production? If so, when can we expect that to be in place?
Comment 14 Bryan Davis 2014-10-29 22:50:06 UTC
(In reply to Greg Grossmeier from comment #13)
> (In reply to Bryan Davis from comment #11)
> > 
> > If restarting hhvm is the cause, the frequency would depend on the level of
> > gerrit +2 activity, the number of tests running and the time needed for the
> > hhvm fcgi process to restart. Mostly impossible to calculate before hand.
> > Without a method to de-pool and drain an hhvm fcgi service before restarting
> > it it will be quite likely that there are some failed requests.
> 
> Is this (where "this" == "method to de-pool and drain an hhvm fcgi service
> before restarting") something that is planned to be added to scap for
> production? If so, when can we expect that to be in place?

Giuseppe is interested in adding etcd support to pybal that could make this possible but I don't think there is a timeline for it. Also beta does not use pybal so that is more work as well.

I'm a bit concerned about the possibility of 503 storms in prod as well.
Comment 15 Chris McMahon 2014-10-29 22:55:15 UTC
"a bit"?
Comment 16 Greg Grossmeier 2014-10-29 22:55:45 UTC
(In reply to Bryan Davis from comment #14)
> Giuseppe is interested in adding etcd support to pybal that could make this
> possible but I don't think there is a timeline for it. Also beta does not
> use pybal so that is more work as well.

/me nods

> I'm a bit concerned about the possibility of 503 storms in prod as well.

Ditto.

Also, I hate to say it, but if this is causing Chris and team (and other teams) a ton of pain (it seems it is, and potentially could be the cause of all the flack "OMG! unstable Beta Cluster!!!!11" we're getting lately, that's not clear yet)... I don't know what to do.
Comment 17 Ori Livneh 2014-10-29 23:16:51 UTC
I think someone should actually do the legwork to investigate this issue by cross-referencing 503s against the log files before we get carried away thinking about how we're going to solve the possible manifestation of this possible problem in production.
Comment 18 Greg Grossmeier 2014-10-29 23:38:33 UTC
(In reply to Ori Livneh from comment #17)
> I think someone should actually do the legwork to investigate this issue by
> cross-referencing 503s against the log files before we get carried away
> thinking about how we're going to solve the possible manifestation of this
> possible problem in production.

From the above comment:

https://integration.wikimedia.org/ci/job/beta-scap-eqiad/27596/
corresponds to
https://saucelabs.com/jobs/eac7a841b8ff4b909419f7418a6b2194

Before I do more.. that's a losing battle (clocks aren't sync'd between SauceLabs and Jenkins/Beta, Browser tests aren't the way to tell if 503s are occuring, etc). We need monitoring of the response codes for Beta Cluster. If the fcgi is causing the monitoring to see 503s often, then we know that's the issue.

Can we agree that doing a scap every 10 minutes shouldn't cause 503s to randomly hit users and browser tests? We have browser tests that take longer than 10 minutes.
Comment 19 Antoine "hashar" Musso (WMF) 2014-10-30 13:07:00 UTC
When the change "make scap restart HHVM" ( https://gerrit.wikimedia.org/r/#/c/163078/ ) has been deployed on the beta cluster, Bryan Davis immediately noticed the apache timeout issue. Quoting him:


> In beta I'm seeing errors that line up with hhvm restarts:
> 
> [proxy_fcgi:error] [pid 10116] [client 10.68.16.12:13867] AH01079:
> failed to make connection to backend: 127.0.0.1
> 
> In beta we don't use pybal so there is no way at all to depool/repool around
> a restart. In production I don't think that we have any automatable means for
> depooling/repooling at this time. How worried should I be about a 503 storm
> if we put this into production? Without depooling it is going to happen.

And I followed up with:

> A few random ideas:
>
> More proxy: Assuming HHVM restart is reasonably fast, could we set a timeout
> in Apache ?   That would mean adding mod_proxy in between Apache and hhvm
> fcgi which will come with a bunch of potential failures on its own :/
>
> Add in pybal some RPC to remotely depool/repool a node
>
> Does hhvm supports a graceful stop?  ie stop accepting new connections (that
> would make pybal to unpool the node) but keep processing ongoing requests ?


Seems we need to make HHVM smarter ala "apache graceful-stop".

On beta cluster, the load balancing is handled by the backend Varnish which have the web servers as backend. There might be a way to ask the Varnishes to stop sending new connections to a web server that is about to be restarted.


Finally, the patch was made to clear HHVM caches for JIT PCRE, both might need a feature request to be loaded with HHVM upstream so it can be reclaimed automatically or forced from time to time.
Comment 20 Bryan Davis 2014-10-30 15:45:49 UTC
When we talked with Emir from Facebook this spring he outlined how they handle restarts for each node:

* Load balancer polls /status.php on each server to determine if it should receive requests
* /status.php checks for a /hhvm/var/stopping file and if it exists returns a status to tell the LB not to send more requests

== Stop ==
* touch /hhvm/var/stopping to trigger depool response to LB
* wait up to 30s for requests to drain
** poll the built-in admin server to find current request load (/check-load)
** if >0: sleep 5s
* send stop signal to server via built-in admin server's /stop command
* send SIGTERM to hhvm processes via pkill
* poll /status.php once a second for up to 30s looking for a connection failure
* send SIGKILL to hhvm processes via pkill
* remove /hhvm/var/stopping trigger file

== Start ==
* remove /hhvm/var/stopping trigger file
* warm OS cache of hhvm binary and hhbc cache via `cat >/dev/null`
* start hhvm process
* poll admin server once per second for up to 10s to see if it started
** `exit 1` if not seen
* poll /status.php
** `exit 1` if you don't get a valid response

== Restart ==
* stop
* start

See also <https://github.com/emiraga/hhvm-deploy-ideas/blob/master/2restart/2restart.sh>
Comment 21 Antoine "hashar" Musso (WMF) 2014-10-30 22:14:06 UTC
Thank you for the high level synopsis, that seems to be a good base to add graceful/graceful-stop to the init/upstart script :-)

Pybal ( operations/debs/pybal ) has different monitors under pybal/monitors. the proxy fetch one uses twisted to do some HTTP queries. Seems similar.
Comment 22 Chris McMahon 2014-11-05 14:48:57 UTC
*** Bug 73010 has been marked as a duplicate of this bug. ***
Comment 23 Gerrit Notification Bot 2014-11-05 16:07:18 UTC
Change 171277 had a related patch set uploaded by BryanDavis:
Revert "Make scap restart HHVM"

https://gerrit.wikimedia.org/r/171277
Comment 24 Gerrit Notification Bot 2014-11-05 16:12:29 UTC
Change 171277 merged by jenkins-bot:
Revert "Make scap restart HHVM"

https://gerrit.wikimedia.org/r/171277
Comment 25 Bryan Davis 2014-11-05 16:18:15 UTC
I have removed the hhvm restart step from scap for now. I'm pretty sure that we will want to put it back in some form in the long term, but for now it is causing more pain than good. Let's see if this makes the mystery 503s go away.
Comment 26 Antoine "hashar" Musso (WMF) 2014-11-08 15:51:51 UTC
Assuming not restarting HHVM every 10 minutes solved the issue :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links