Last modified: 2012-06-08 18:41:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T39072, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 37072 - jobrunner trapped in a loop cause of webVideoTranscode job
jobrunner trapped in a loop cause of webVideoTranscode job
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-24 07:34 UTC by Antoine "hashar" Musso (WMF)
Modified: 2012-06-08 18:41 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Antoine "hashar" Musso (WMF) 2012-05-24 07:34:38 UTC
The job-loop triggers calls to MediaWiki maintenance/runJobs.php script. For some reason, the processes never ends and eat up all CPU.

They are jobs like:

 mwscript runJobs.php --wiki=commonswiki --procs=5 &

Aka there is no type.


The commonswiki job table had two job requests for webVideoTranscode :


(mw@deployment-sql) [commonswiki]> select * from job \G
*************************** 1. row ***************************
       job_id: 1917
      job_cmd: webVideoTranscode
job_namespace: 6
    job_title: Mayday2012-edit-1.ogv
job_timestamp: 20120523195317
   job_params: a:2:{s:13:"transcodeMode";s:10:"derivative";s:12:"transcodeKey";s:8:"160p.ogv";}
*************************** 2. row ***************************
       job_id: 1918
      job_cmd: webVideoTranscode
job_namespace: 6
    job_title: Mayday2012-edit-1.ogv
job_timestamp: 20120523195317
   job_params: a:2:{s:13:"transcodeMode";s:10:"derivative";s:12:"transcodeKey";s:9:"480p.webm";}
2 rows in set (0.00 sec)
(mw@deployment-sql) [commonswiki]>


So it seems the runJobs.php script keep looping forever trying to achieves the jobs.

Deleting the jobs solve the looping issue:

(mw@deployment-sql) [commonswiki]> delete from job;
Query OK, 2 rows affected (0.38 sec)
Comment 1 Antoine "hashar" Musso (WMF) 2012-05-24 07:36:27 UTC
Find out:
- why triggered job never ends up running
- why despite having only 2 jobs, there is several forked process
- why job stick in the queue
Comment 2 Antoine "hashar" Musso (WMF) 2012-05-24 07:43:40 UTC
Looks like job::pop() fail to delete the jobs from the database :-(
Comment 3 Antoine "hashar" Musso (WMF) 2012-05-28 08:27:07 UTC
I found the root cause while sleeping this week-end.

The cause is that transcode jobs are excluded from being processed by runJobs.php (through the use of $wgJobTypesExcludedFromDefaultQueue) whereas nextJobDB.php still consider those jobs as in need of processing.  End result is an infinite loop since jobs are never processed.


Hence the addition of $wgJobTypesExcludedFromDefaultQueue, by commit 45f9da8ad7, need to be enhanced.
Comment 4 Antoine "hashar" Musso (WMF) 2012-05-28 08:33:18 UTC
Raising priority as a remember to get that reviewed asap.  It causes disruptions on deployment-prep .

Patch to MW Core:
https://gerrit.wikimedia.org/r/9116
Comment 5 Antoine "hashar" Musso (WMF) 2012-06-08 12:34:44 UTC
Gerrit change #9116, which fixed nextJobDB.php, has been merged in. A similar issue is occurring with runJobs.php which also can lead to an infinite loop. Proposed change is:

https://gerrit.wikimedia.org/r/10692
Comment 6 Antoine "hashar" Musso (WMF) 2012-06-08 18:41:55 UTC
Both patches merged. I have them applied to the beta cluster and there is no more infinite loop issue.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links