Last modified: 2014-08-14 12:20:30 UTC
I logged in to beta labs as User:jdlrobson and made this edit: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User_talk%3ASelenium_user&diff=119002&oldid=119000 When I log in as Selenium user I do not see a notification for this event.
legoktm@deployment-bastion:~$ mwscript showJobs.php --wiki=enwiki --group htmlCacheUpdate: 74 queued; 0 claimed (0 active, 0 abandoned); 0 delayed enotifNotify: 41 queued; 0 claimed (0 active, 0 abandoned); 0 delayed cirrusSearchDeletePages: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed cirrusSearchLinksUpdatePrioritized: 3104 queued; 0 claimed (0 active, 0 abandoned); 0 delayed LocalRenameUserJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed updateBetaFeaturesUserCounts: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed ParsoidCacheUpdateJobOnEdit: 2931 queued; 0 claimed (0 active, 0 abandoned); 0 delayed ParsoidCacheUpdateJobOnDependencyChange: 5676 queued; 0 claimed (0 active, 0 abandoned); 0 delayed EchoNotificationJob: 1558 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
There is no component for jobrunner yet in bugzilla (bug 68318). Ccing authors Aaron and Ori.
Created attachment 16158 [details] /etc/jobrunner/jobrunner.conf on deployment-jobrunner01.eqiad.wmflabs We have a jobrunner for generic jobs: deployment-jobrunner01.eqiad.wmflabs which has the puppet class role::beta::jobrunner applied. I reran puppet on the instance: Notice: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Mediawiki::Jobrunner/Service[jobrunner]: Unscheduling refresh on Service[jobrunner] But the service does not start: # service jobrunner status jobrunner stop/waiting # In /var/log/syslog I found out: Aug 8 14:00:52 deployment-jobrunner01 php: PHP Warning: syntax error, unexpected '{' in /etc/jobrunner/jobrunner.conf on line 3#012 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 128 PHP Fatal error: Uncaught exception 'Exception' with message 'Could not parse file at '/etc/jobrunner/jobrunner.conf'.' in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService:132#012Stack trace:#012#0 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(51): RedisJobRunnerService::init(Array)#012#1 {main}#012 thrown in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 132 The file /etc/jobrunner/jobrunner.conf is a json file managed by puppet and it is invalid: php > $json = file_get_contents('/etc/jobrunner/jobrunner.conf'); php > var_dump( json_decode( $json ) ); NULL php > PHP json_decode() returns NULL if the json cannot be decoded or if the encoded data is deeper than the recursion limit. I have attached the file
The file has inline comments using // which is not supported by PHP json_decode(). Removing the comment fix the issue.
I believe the new jobrunner service is only used on HHVM. So adding keyword hiphop.
The deployed version of jobrunner in beta had lagged behind the configuration. Additionally there was a local hotpatch on deployment-jobrunner01 that prevented trebuchet from updating the checkout properly. I fixed these two things and the jobrunner is operational again. (In reply to Antoine "hashar" Musso from comment #4) > The file has inline comments using // which is not supported by PHP > json_decode(). Removing the comment fix the issue. This is actually handled in the latest version. Aaron strips the comments before parsing the file as json.
(In reply to Antoine "hashar" Musso from comment #5) > I believe the new jobrunner service is only used on HHVM. So adding keyword > hiphop. The new jobrunner is actually compatible with both php5 and hhvm. We are running it in production on both interpreters.
Excellent. Thank you very much :]
Thanks for looking into this quickly. I still see the same number of jobs (well there are more now...) queued though? htmlCacheUpdate: 81 queued; 0 claimed (0 active, 0 abandoned); 0 delayed enotifNotify: 43 queued; 0 claimed (0 active, 0 abandoned); 0 delayed cirrusSearchDeletePages: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed cirrusSearchLinksUpdatePrioritized: 3170 queued; 0 claimed (0 active, 0 abandoned); 0 delayed LocalRenameUserJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed updateBetaFeaturesUserCounts: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed ParsoidCacheUpdateJobOnEdit: 2992 queued; 0 claimed (0 active, 0 abandoned); 0 delayed ParsoidCacheUpdateJobOnDependencyChange: 5791 queued; 0 claimed (0 active, 0 abandoned); 0 delayed EchoNotificationJob: 1640 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
It looks like we have a configuration issue. I see `"runners": 0,` for all of the groups in the config file. This is probably a puppet problem.
role::beta::jobrunner has: class { '::mediawiki::jobrunner': aggr_servers => [ '10.68.16.146' ], queue_servers => [ '10.68.16.146' ], } And the puppet class ::mediawiki::jobrunner has all settings to default to 0 :-]
Change 152931 had a related patch set uploaded by Hashar: beta: Set runners_* for role::beta::jobrunner https://gerrit.wikimedia.org/r/152931
[17:13] < bd808> legoktm: Can you check the job count again? [17:13] < legoktm> yay it's going down! [17:13] < legoktm> EchoNotificationJob: 425 queued; 0 claimed (0 active, 0 abandoned); 0 delayed Patch is in beta via cherry-pick
Can this be closed now?
deployment-bastion:~$ mwscript showJobs.php --wiki=enwiki --group cirrusSearchLinksUpdatePrioritized: 0 queued; 3 claimed (0 active, 3 abandoned); 0 delayed $ I guess it is ok now :)
Change 152931 merged by Ori.livneh: beta: Set runners_* for role::beta::jobrunner https://gerrit.wikimedia.org/r/152931