Last modified: 2014-09-08 17:42:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70574, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68574 - beta labs not responding; API shows 503 from varnish
beta labs not responding; API shows 503 from varnish
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: High major
: ---
Assigned To: Brett Simmers
: hhvm
: 68684 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-25 14:27 UTC by Chris McMahon
Modified: 2014-09-08 17:42 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Chris McMahon 2014-07-25 14:27:27 UTC
Neither http://en.wikipedia.beta.wmflabs.org/ nor http://en.wikipedia.beta.wmflabs.org/w/api.php are responding right now. 

After a long wait, http://en.wikipedia.beta.wmflabs.org/w/api.php returns with

 Request: GET http://en.wikipedia.beta.wmflabs.org/w/api.php, from 67.1.150.67 via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 760419132
Forwarded for: 67.1.150.67
Error: 503, Service Unavailable at Fri, 25 Jul 2014 14:24:01 GMT
Comment 1 Greg Grossmeier 2014-07-25 17:12:50 UTC
Bryan kicked HHVM and things are back (just slow for me). But still investigating.

16:57 <     bd808> greg-g: [19:23]  <      ori>I OK, I merged the config change for Labs, so we'll probably know within the next hour or so if we have additional bugs on our hands
16:57 <    bblack> hah
16:58 <    greg-g> 19:23 what time?
16:58 <     bd808> MDT
16:59 <    bblack> so that puts it about 1:22 before the last event to udplog
17:00 <     bd808> so ... in a hour we might be hosed again?
17:00 <    bblack> probably :)


17:02 <     bd808> ori: hhvm on both beta servers was borked. ps showed hundreds of zombie sh processes with hhvm as the parent.
Comment 2 Bryan Davis 2014-07-25 17:17:20 UTC
The last event seen in logstash was at 2014-07-25T14:45:04.835Z. Ori's irc message would have been around 2014-07-25T01:23Z.
Comment 3 Bryan Davis 2014-07-25 17:33:56 UTC
In apache error logs I see lots and lots and lots of:

[Fri Jul 25 14:45:59.516788 2014] [proxy_fcgi:error] [pid 17215] (70014)End of f
ile found: [client 10.68.16.12:62752] AH01075: Error dispatching request to :
[Fri Jul 25 14:46:01.058387 2014] [proxy_fcgi:error] [pid 17007] [client 10.68.16.12:62750] AH01067: Failed to read FastCGI header


/var/log/hhvm/error.log lamely doesn't contain timestamps. I didn't see anything obvious in there however.
Comment 4 Tim Landscheidt 2014-07-27 17:09:17 UTC
*** Bug 68684 has been marked as a duplicate of this bug. ***
Comment 5 Andre Klapper 2014-07-30 09:58:58 UTC
So is this an upstream issue resembling https://github.com/facebook/hhvm/issues/2531 ? Or do we (Wikimedia) plan to investigate a workaround/fix ourselves here ("high priority" set)?
Comment 6 Antoine "hashar" Musso (WMF) 2014-07-30 10:13:35 UTC
We talked about the issue during the RelEng/QA weekly checkin.  There is an engineer of Facebook in WMF office for a month and the HHVM folks attempt to gather as much stacktrace/crashes as possible to get them documented for later investigation.  There is a lot of changes being made to hhvm code base or configuration to finely prepare it for production.

In short: beta cluster is going to be unstable for a few :-/


The long term would be to create a new cluster dedicated to run browser tests QA which would be updated only once a day or so.  That should be more stable. We track that as Bug 65127 - Setup multiversion on Beta Cluster for nightly build browser testing support.
Comment 7 Brett Simmers 2014-08-02 03:25:50 UTC
I'm working on this. I'm pretty sure it's a bug in hhvm's fastcgi server.
Comment 8 Bryan Davis 2014-09-06 22:18:55 UTC
I think this particular bug is long since fixed in our inetal HHVM builds (and probably in the upstream by now). Brett, Ori, Giuseppe can you confirm?
Comment 9 Brett Simmers 2014-09-08 17:42:25 UTC
Yeah this was fixed by https://github.com/facebook/hhvm/commit/60d27550e305d92463469de2b16fc125bae8d79a

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links