Last modified: 2014-09-08 17:42:25 UTC
Neither http://en.wikipedia.beta.wmflabs.org/ nor http://en.wikipedia.beta.wmflabs.org/w/api.php are responding right now. After a long wait, http://en.wikipedia.beta.wmflabs.org/w/api.php returns with Request: GET http://en.wikipedia.beta.wmflabs.org/w/api.php, from 67.1.150.67 via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 760419132 Forwarded for: 67.1.150.67 Error: 503, Service Unavailable at Fri, 25 Jul 2014 14:24:01 GMT
Bryan kicked HHVM and things are back (just slow for me). But still investigating. 16:57 < bd808> greg-g: [19:23] < ori>I OK, I merged the config change for Labs, so we'll probably know within the next hour or so if we have additional bugs on our hands 16:57 < bblack> hah 16:58 < greg-g> 19:23 what time? 16:58 < bd808> MDT 16:59 < bblack> so that puts it about 1:22 before the last event to udplog 17:00 < bd808> so ... in a hour we might be hosed again? 17:00 < bblack> probably :) 17:02 < bd808> ori: hhvm on both beta servers was borked. ps showed hundreds of zombie sh processes with hhvm as the parent.
The last event seen in logstash was at 2014-07-25T14:45:04.835Z. Ori's irc message would have been around 2014-07-25T01:23Z.
In apache error logs I see lots and lots and lots of: [Fri Jul 25 14:45:59.516788 2014] [proxy_fcgi:error] [pid 17215] (70014)End of f ile found: [client 10.68.16.12:62752] AH01075: Error dispatching request to : [Fri Jul 25 14:46:01.058387 2014] [proxy_fcgi:error] [pid 17007] [client 10.68.16.12:62750] AH01067: Failed to read FastCGI header /var/log/hhvm/error.log lamely doesn't contain timestamps. I didn't see anything obvious in there however.
*** Bug 68684 has been marked as a duplicate of this bug. ***
So is this an upstream issue resembling https://github.com/facebook/hhvm/issues/2531 ? Or do we (Wikimedia) plan to investigate a workaround/fix ourselves here ("high priority" set)?
We talked about the issue during the RelEng/QA weekly checkin. There is an engineer of Facebook in WMF office for a month and the HHVM folks attempt to gather as much stacktrace/crashes as possible to get them documented for later investigation. There is a lot of changes being made to hhvm code base or configuration to finely prepare it for production. In short: beta cluster is going to be unstable for a few :-/ The long term would be to create a new cluster dedicated to run browser tests QA which would be updated only once a day or so. That should be more stable. We track that as Bug 65127 - Setup multiversion on Beta Cluster for nightly build browser testing support.
I'm working on this. I'm pretty sure it's a bug in hhvm's fastcgi server.
I think this particular bug is long since fixed in our inetal HHVM builds (and probably in the upstream by now). Brett, Ori, Giuseppe can you confirm?
Yeah this was fixed by https://github.com/facebook/hhvm/commit/60d27550e305d92463469de2b16fc125bae8d79a