Last modified: 2014-10-31 23:57:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73853, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71853 - Retry counts not working


Summary:	Retry counts not working

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	JobQueue (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-10-08 22:19 UTC by Gabriel Wicke
Modified:	2014-10-31 23:57 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Gabriel Wicke 2014-10-08 22:19:51 UTC

The job retry counters don't seem to be working, at least for ParsoidCache* jobs:


gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401'
86

gwicke@fluorine:/a/mw-log$ grep ParsoidCache runJobs.log | grep -c '=201401'
4006


Normally I'd expect those jobs to be retried three times only (default setting for 'attempts' in puppet/modules/mediawiki/templates/jobrunner/jobrunner.conf.erb                ). Instead, these jobs have been around since January. The 86 jobs have been executed 4006 times today alone, which works out to more than three retries per job.

Example runJobs.log line:

2014-10-08 08:35:49 mw1007 enwiki: ParsoidCacheUpdateJobOnDependencyChange Module:Message_box/configuration pages=array(12) rootJobSignature=dd78b0f638bd2f608d82c1e31574b46271060622 rootJobTimestamp=20140127010733 table=templatelinks type=OnDependencyChange t=14415 good

t=14415 would be over the default timeout of 3600.

Comment 1 Aaron Schulz 2014-10-08 23:59:22 UTC

I don't get this math at all. You are comparing current pending jobs with a root timestamp prefix to all of those run today(ish) with that root timestamp prefix.

A slightly more interesting query is:

aaron@fluorine:~/mw-log$ head -n 100000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | sort | uniq -c | sort -k1 -n -r | head -n 20
    310 svwiki: ParsoidCacheUpdateJobOnDependencyChange Mall:Infobox_biografi pages=array(12) rootJobSignature=a5adce8cb4b7f6429c63373cf750eb7fa58d4952 rootJobTimestamp=20141008081240 table=templatelinks type=OnDependencyChange STARTING
    232 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:utilities pages=array(12) rootJobSignature=06aaf41a1bd13c4ebe635e76eb45193d8b9fb46a rootJobTimestamp=20140612102850 table=templatelinks type=OnDependencyChange STARTING
    164 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:languages/data2 pages=array(12) rootJobSignature=e458a770b9fdf5efb61fb51efd18f26bacc7780f rootJobTimestamp=20140908024938 table=templatelinks type=OnDependencyChange STARTING

This still useless since pages=array(12) does not imply that it was the *same* 12 pages. That's the expected results of the partitioning scheme. The query would work if each job had a hash of the params printed to the log though...

Comment 2 Gabriel Wicke 2014-10-14 19:09:35 UTC

(In reply to Aaron Schulz from comment #1)
> I don't get this math at all. You are comparing current pending jobs with a
> root timestamp prefix to all of those run today(ish) with that root
> timestamp prefix.

The point is that all those are retries, and that there are many more retries of those very old jobs than we'd expect from the default retry limit of 3. And this is by looking on the logs of a single day only. Those jobs have been retried many more times since January.

Comment 3 Aaron Schulz 2014-10-14 22:01:47 UTC

Looking in Redis I also don't see any jobs tried more than thrice now:

> var_dump( $conn->hGetAll( wfMemcKey( 'jobqueue', $type, 'h-attempts' ) ), $prop );
array(4) {
  ["dd3198047ef64b6c8d6b9bc4d9455f72"]=>
  string(1) "3"
  ["cfe7372acf9c412da3fc30f58bc16073"]=>
  string(1) "3"
  ["79df259a1a3b4227bceab760c11cc64d"]=>
  string(1) "1"
  ["8166de1e7c3140aeb4fbfb9b31ce87bb"]=>
  string(1) "1"
}


As with a few weeks before, the abandoned (and therefore recently retried) job count is very low with that queue.

Actually looking at the code long enough made me notice a somewhat opposite bug (e.g. not enough retries, fixed in https://gerrit.wikimedia.org/r/#/c/166681/ ).

Comment 4 Gabriel Wicke 2014-10-16 14:19:37 UTC

Just re-checked on tin:

gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401'
86

So all of those jobs are still in the queue, despite being retried many times since I opened this bug (I count 3250 executions in today's log).

So perhaps the retry counter is not incremented properly for jobs that succeed after the timeout has elapsed?

Comment 5 Gabriel Wicke 2014-10-31 00:50:16 UTC

@Aaron: Should we just clear the old jobs by temporarily merging https://gerrit.wikimedia.org/r/#/c/165635/ ? If it's still a problem with new jobs then we'll find new ones to investigate soon. If it's actually fixed & some freak corruption issue, then we might avoid spending too much time on the investigation.

Comment 6 Aaron Schulz 2014-10-31 23:57:41 UTC

It doesn't seem to be the redis job queue retry system. Maybe it's the partitioning code in the Parsoid extension?

aaron@fluorine:~/mw-log$ head -n 10000000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20
      1 (uuid=ffffe377d4694ceda951d80a215ffb77,
      1 (uuid=ffffd36e7f654c4397d76b344a0cf58e,
      1 (uuid=ffffacd744294446909b57e1882b2d87,
      1 (uuid=ffff831d0d2f47a8833487fb3ccfaa6f,
      1 (uuid=ffff7c6b2df845328b6a6e5545dfdfd2,
      1 (uuid=fffef1cd5a3b4ef69ae06cf2056b3e9c,
      1 (uuid=fffed88560724c3a9daa4a1ae8e28a15,
      1 (uuid=fffeb2b7138946298e2871bcf56697c4,
      1 (uuid=fffe9e3930cb49419595dabcdd9e873f,
      1 (uuid=fffe961daccb4654acacf6eb624f9560,
      1 (uuid=fffe93496ad947e0a98e9caabe7ef117,
      1 (uuid=fffe8ae575884c54ba6bd134750e2dd9,
      1 (uuid=fffe29d2643542b5828c1707abb6fc76,
      1 (uuid=fffe0d75e1b149c98e78c14a24ee6cf3,
      1 (uuid=fffdefdfa4764fc7a3ad041f9177893a,
      1 (uuid=fffdeddcfd664ed08f8da0043df51d35,
      1 (uuid=fffdbbe0e9454adfb08a9f3749b2361f,
      1 (uuid=fffd9f3395084cde91865dbe41d7db14,
      1 (uuid=fffd90b99c1c40a095be8d6d66aaa5f1,
      1 (uuid=fffd850a6b344fffaf904fc1a5df567c,

aaron@fluorine:~/mw-log$ grep -Po '\w+: refreshLinks .+STARTING' runJobs.log | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20
      2 (uuid=a5cabb857ffd4953a4484d1a9c75d91c,
      2 (uuid=67be1fc2aec44fc9821092478fbefcbb,
      1 (uuid=fffff607cfd2421384a2addb06fdf67d,
      1 (uuid=ffffeff90821463f8b006e5358dd6a67,
      1 (uuid=ffffef61e5554ecfb5ff5b002b3788c0,
      1 (uuid=ffffeee9ec4942cf9a1e46a7759da168,
      1 (uuid=ffffec525411482d892480d6bc9bdefa,
      1 (uuid=ffffe9cbf6c04bfdbd7dbb39b134afdf,
      1 (uuid=ffffe7522cd545f6b201b6f38ba5fe13,
      1 (uuid=ffffe6c011414031b88d7e735728e3c3,
      1 (uuid=ffffe51c1d8a4cc4bbead6e5bb285e94,
      1 (uuid=ffffe44f1c2e48d7ae0b1a9aec87eca4,
      1 (uuid=ffffe02ca1c443da9ff2fe1485eeec0f,
      1 (uuid=ffffde5755014d918c98818a2a937f34,
      1 (uuid=ffffd6897a01417fbb996918c8fdbdc8,
      1 (uuid=ffffd2fde873473297b4f194e3791508,
      1 (uuid=ffffd17e8a2f40de85940d24000bbea7,
      1 (uuid=ffffc1bec5fa44ed841ba2cf4811d488,
      1 (uuid=ffffc187ad06449f9a5b0b862f3b7eca,
      1 (uuid=ffffb30fd1b54aa494c09ccc8d599d30,

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links