Last modified: 2014-10-31 23:57:41 UTC
The job retry counters don't seem to be working, at least for ParsoidCache* jobs: gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401' 86 gwicke@fluorine:/a/mw-log$ grep ParsoidCache runJobs.log | grep -c '=201401' 4006 Normally I'd expect those jobs to be retried three times only (default setting for 'attempts' in puppet/modules/mediawiki/templates/jobrunner/jobrunner.conf.erb ). Instead, these jobs have been around since January. The 86 jobs have been executed 4006 times today alone, which works out to more than three retries per job. Example runJobs.log line: 2014-10-08 08:35:49 mw1007 enwiki: ParsoidCacheUpdateJobOnDependencyChange Module:Message_box/configuration pages=array(12) rootJobSignature=dd78b0f638bd2f608d82c1e31574b46271060622 rootJobTimestamp=20140127010733 table=templatelinks type=OnDependencyChange t=14415 good t=14415 would be over the default timeout of 3600.
I don't get this math at all. You are comparing current pending jobs with a root timestamp prefix to all of those run today(ish) with that root timestamp prefix. A slightly more interesting query is: aaron@fluorine:~/mw-log$ head -n 100000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | sort | uniq -c | sort -k1 -n -r | head -n 20 310 svwiki: ParsoidCacheUpdateJobOnDependencyChange Mall:Infobox_biografi pages=array(12) rootJobSignature=a5adce8cb4b7f6429c63373cf750eb7fa58d4952 rootJobTimestamp=20141008081240 table=templatelinks type=OnDependencyChange STARTING 232 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:utilities pages=array(12) rootJobSignature=06aaf41a1bd13c4ebe635e76eb45193d8b9fb46a rootJobTimestamp=20140612102850 table=templatelinks type=OnDependencyChange STARTING 164 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:languages/data2 pages=array(12) rootJobSignature=e458a770b9fdf5efb61fb51efd18f26bacc7780f rootJobTimestamp=20140908024938 table=templatelinks type=OnDependencyChange STARTING This still useless since pages=array(12) does not imply that it was the *same* 12 pages. That's the expected results of the partitioning scheme. The query would work if each job had a hash of the params printed to the log though...
(In reply to Aaron Schulz from comment #1) > I don't get this math at all. You are comparing current pending jobs with a > root timestamp prefix to all of those run today(ish) with that root > timestamp prefix. The point is that all those are retries, and that there are many more retries of those very old jobs than we'd expect from the default retry limit of 3. And this is by looking on the logs of a single day only. Those jobs have been retried many more times since January.
Looking in Redis I also don't see any jobs tried more than thrice now: > var_dump( $conn->hGetAll( wfMemcKey( 'jobqueue', $type, 'h-attempts' ) ), $prop ); array(4) { ["dd3198047ef64b6c8d6b9bc4d9455f72"]=> string(1) "3" ["cfe7372acf9c412da3fc30f58bc16073"]=> string(1) "3" ["79df259a1a3b4227bceab760c11cc64d"]=> string(1) "1" ["8166de1e7c3140aeb4fbfb9b31ce87bb"]=> string(1) "1" } As with a few weeks before, the abandoned (and therefore recently retried) job count is very low with that queue. Actually looking at the code long enough made me notice a somewhat opposite bug (e.g. not enough retries, fixed in https://gerrit.wikimedia.org/r/#/c/166681/ ).
Just re-checked on tin: gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401' 86 So all of those jobs are still in the queue, despite being retried many times since I opened this bug (I count 3250 executions in today's log). So perhaps the retry counter is not incremented properly for jobs that succeed after the timeout has elapsed?
@Aaron: Should we just clear the old jobs by temporarily merging https://gerrit.wikimedia.org/r/#/c/165635/ ? If it's still a problem with new jobs then we'll find new ones to investigate soon. If it's actually fixed & some freak corruption issue, then we might avoid spending too much time on the investigation.
It doesn't seem to be the redis job queue retry system. Maybe it's the partitioning code in the Parsoid extension? aaron@fluorine:~/mw-log$ head -n 10000000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20 1 (uuid=ffffe377d4694ceda951d80a215ffb77, 1 (uuid=ffffd36e7f654c4397d76b344a0cf58e, 1 (uuid=ffffacd744294446909b57e1882b2d87, 1 (uuid=ffff831d0d2f47a8833487fb3ccfaa6f, 1 (uuid=ffff7c6b2df845328b6a6e5545dfdfd2, 1 (uuid=fffef1cd5a3b4ef69ae06cf2056b3e9c, 1 (uuid=fffed88560724c3a9daa4a1ae8e28a15, 1 (uuid=fffeb2b7138946298e2871bcf56697c4, 1 (uuid=fffe9e3930cb49419595dabcdd9e873f, 1 (uuid=fffe961daccb4654acacf6eb624f9560, 1 (uuid=fffe93496ad947e0a98e9caabe7ef117, 1 (uuid=fffe8ae575884c54ba6bd134750e2dd9, 1 (uuid=fffe29d2643542b5828c1707abb6fc76, 1 (uuid=fffe0d75e1b149c98e78c14a24ee6cf3, 1 (uuid=fffdefdfa4764fc7a3ad041f9177893a, 1 (uuid=fffdeddcfd664ed08f8da0043df51d35, 1 (uuid=fffdbbe0e9454adfb08a9f3749b2361f, 1 (uuid=fffd9f3395084cde91865dbe41d7db14, 1 (uuid=fffd90b99c1c40a095be8d6d66aaa5f1, 1 (uuid=fffd850a6b344fffaf904fc1a5df567c, aaron@fluorine:~/mw-log$ grep -Po '\w+: refreshLinks .+STARTING' runJobs.log | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20 2 (uuid=a5cabb857ffd4953a4484d1a9c75d91c, 2 (uuid=67be1fc2aec44fc9821092478fbefcbb, 1 (uuid=fffff607cfd2421384a2addb06fdf67d, 1 (uuid=ffffeff90821463f8b006e5358dd6a67, 1 (uuid=ffffef61e5554ecfb5ff5b002b3788c0, 1 (uuid=ffffeee9ec4942cf9a1e46a7759da168, 1 (uuid=ffffec525411482d892480d6bc9bdefa, 1 (uuid=ffffe9cbf6c04bfdbd7dbb39b134afdf, 1 (uuid=ffffe7522cd545f6b201b6f38ba5fe13, 1 (uuid=ffffe6c011414031b88d7e735728e3c3, 1 (uuid=ffffe51c1d8a4cc4bbead6e5bb285e94, 1 (uuid=ffffe44f1c2e48d7ae0b1a9aec87eca4, 1 (uuid=ffffe02ca1c443da9ff2fe1485eeec0f, 1 (uuid=ffffde5755014d918c98818a2a937f34, 1 (uuid=ffffd6897a01417fbb996918c8fdbdc8, 1 (uuid=ffffd2fde873473297b4f194e3791508, 1 (uuid=ffffd17e8a2f40de85940d24000bbea7, 1 (uuid=ffffc1bec5fa44ed841ba2cf4811d488, 1 (uuid=ffffc187ad06449f9a5b0b862f3b7eca, 1 (uuid=ffffb30fd1b54aa494c09ccc8d599d30,