Last modified: 2014-09-08 17:42:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70574, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 68574 - beta labs not responding; API shows 503 from varnish


Summary:	beta labs not responding; API shows 503 from varnish

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	deployment-prep (beta) (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High major
Target Milestone:	---
Assigned To:	Brett Simmers

URL:
Whiteboard:
Keywords:	hhvm

Duplicates:	68684 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-07-25 14:27 UTC by Chris McMahon
Modified:	2014-09-08 17:42 UTC (History)
CC List:	14 users (show)

See Also:	https://github.com/facebook/hhvm/issues/2531
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Chris McMahon 2014-07-25 14:27:27 UTC

Neither http://en.wikipedia.beta.wmflabs.org/ nor http://en.wikipedia.beta.wmflabs.org/w/api.php are responding right now. 

After a long wait, http://en.wikipedia.beta.wmflabs.org/w/api.php returns with

 Request: GET http://en.wikipedia.beta.wmflabs.org/w/api.php, from 67.1.150.67 via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 760419132
Forwarded for: 67.1.150.67
Error: 503, Service Unavailable at Fri, 25 Jul 2014 14:24:01 GMT

Comment 1 Greg Grossmeier 2014-07-25 17:12:50 UTC

Bryan kicked HHVM and things are back (just slow for me). But still investigating.

16:57 <     bd808> greg-g: [19:23]  <      ori>I OK, I merged the config change for Labs, so we'll probably know within the next hour or so if we have additional bugs on our hands
16:57 <    bblack> hah
16:58 <    greg-g> 19:23 what time?
16:58 <     bd808> MDT
16:59 <    bblack> so that puts it about 1:22 before the last event to udplog
17:00 <     bd808> so ... in a hour we might be hosed again?
17:00 <    bblack> probably :)


17:02 <     bd808> ori: hhvm on both beta servers was borked. ps showed hundreds of zombie sh processes with hhvm as the parent.

Comment 2 Bryan Davis 2014-07-25 17:17:20 UTC

The last event seen in logstash was at 2014-07-25T14:45:04.835Z. Ori's irc message would have been around 2014-07-25T01:23Z.

Comment 3 Bryan Davis 2014-07-25 17:33:56 UTC

In apache error logs I see lots and lots and lots of:

[Fri Jul 25 14:45:59.516788 2014] [proxy_fcgi:error] [pid 17215] (70014)End of f
ile found: [client 10.68.16.12:62752] AH01075: Error dispatching request to :
[Fri Jul 25 14:46:01.058387 2014] [proxy_fcgi:error] [pid 17007] [client 10.68.16.12:62750] AH01067: Failed to read FastCGI header


/var/log/hhvm/error.log lamely doesn't contain timestamps. I didn't see anything obvious in there however.

Comment 4 Tim Landscheidt 2014-07-27 17:09:17 UTC

*** Bug 68684 has been marked as a duplicate of this bug. ***

Comment 5 Andre Klapper 2014-07-30 09:58:58 UTC

So is this an upstream issue resembling https://github.com/facebook/hhvm/issues/2531 ? Or do we (Wikimedia) plan to investigate a workaround/fix ourselves here ("high priority" set)?

Comment 6 Antoine "hashar" Musso (WMF) 2014-07-30 10:13:35 UTC

We talked about the issue during the RelEng/QA weekly checkin.  There is an engineer of Facebook in WMF office for a month and the HHVM folks attempt to gather as much stacktrace/crashes as possible to get them documented for later investigation.  There is a lot of changes being made to hhvm code base or configuration to finely prepare it for production.

In short: beta cluster is going to be unstable for a few :-/


The long term would be to create a new cluster dedicated to run browser tests QA which would be updated only once a day or so.  That should be more stable. We track that as Bug 65127 - Setup multiversion on Beta Cluster for nightly build browser testing support.

Comment 7 Brett Simmers 2014-08-02 03:25:50 UTC

I'm working on this. I'm pretty sure it's a bug in hhvm's fastcgi server.

Comment 8 Bryan Davis 2014-09-06 22:18:55 UTC

I think this particular bug is long since fixed in our inetal HHVM builds (and probably in the upstream by now). Brett, Ori, Giuseppe can you confirm?

Comment 9 Brett Simmers 2014-09-08 17:42:25 UTC

Yeah this was fixed by https://github.com/facebook/hhvm/commit/60d27550e305d92463469de2b16fc125bae8d79a

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links