Last modified: 2014-09-27 02:01:56 UTC
Created attachment 16174 [details] simple script that reproduces the issue in production. When running on HHVM, the jobrunner service (configured to use fcgi I suppose) fails to spawn curl requests with the following errors: [Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033896] [] Warning: fork failed - Cannot allocate memory in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 933 [Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033897] [] Notice: Undefined index: 1 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 935 [Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033898] [] Notice: Undefined index: 2 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 936 [Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033899] [] Notice: Undefined index: 0 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 938 [Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033900] [] Warning: Not a valid stream resource in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 938 2014-08-12T09:25:36+0000: Could not spawn process in loop 0: curl -XPOST -s -a 'http://127.0.0.1:9002/rpc/RunJobs.php?wiki=glkwiki&type=ChangeNotification&maxtime=60&maxmem=300M' I tried various tweaks (like raising the memory limit both in the JR script and in hhvm) but nothing seemed to work around this. This seems to be a general problem with hhvm as configured by us btw, I wrote a small script that just forks with proc_open a curl request for enwiki main page, and it spawns the same error (see attachment).
This happens with our packages, of course.
Not seeing this with: sudo -u apache /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf --verbose Also, running some of the curl commands it does gives normal, expected, JSON replies.
22:40 <godog> btw from the issue above there we got the core dumped on mw1053:/tmp via the usual script 22:42 <godog> it looks like this too http://ganglia.wikimedia.org/latest/?r=day&cs=8%2F14%2F2014+5%3A41&ce=8%2F14%2F2014+21%3A13&c=Jobrunners+eqiad&h=mw1053.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS if we don't get any specific clue from the core file on what was going on we could try and disable some job types and see what that does
Fixed by https://github.com/facebook/hhvm/commit/7061ff24162b2afba3738614d9f210cd8fba4a6c