Last modified: 2014-05-27 22:58:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65681, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63681 - beta cluster job runner keep running some periodic tasks
beta cluster job runner keep running some periodic tasks
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-08 15:49 UTC by Antoine "hashar" Musso (WMF)
Modified: 2014-05-27 22:58 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Antoine "hashar" Musso (WMF) 2014-04-08 15:49:00 UTC
The deployment-jobrunner01 instance is quite busy.  Looking at /data/project/logs/runJobs.log it keeps running 31 periodic queue tasks every few seconds for the simplewiki:

2014-04-08 15:44:51 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:44:55 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:03 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:07 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:10 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:16 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:19 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).


I have no clue how to list the tasks. maintenance/showJobs.php does not give any clue nor  getJobQueueLenghts :-/

Looking at the cli.log there is a bunch of:

 simplewiki-c91d295c: 2.0489  22.5M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
 simplewiki-c91d295c: 2.0498  22.5M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
 simplewiki-c91d295c: 2.0507  22.5M  [memcached] result: NOT STORED


Ccing Aaron who probably knows how to fix that :-]
Comment 1 Aaron Schulz 2014-04-08 16:00:29 UTC
Do the cas operations succeed? If they do, maybe memcached has a high eviction rate or something?
Comment 2 Aaron Schulz 2014-04-08 16:01:27 UTC
I don't see this probably locally or on production, so it seems specific to beta.
Comment 3 Antoine "hashar" Musso (WMF) 2014-04-09 09:45:10 UTC
Here is the debug log for session simplewiki-6a538816 from cli.log:

1.7059  14.2M  [caches] main: MemcachedPeclBagOStuff, message: MemcachedPeclBagOStuff, parser: MemcachedPhpBagOStuff
2.1357  18.2M  [caches] LocalisationCache: using store LCStoreCDB
2.9236  22.0M  Fully initialised
3.0402  22.0M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1017  22.5M  Creating a new RedisConnectionPool instance with id 95797b002a3bb33394c101391fbb456c43b076e3.
3.1666  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1680  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1688  22.8M  [memcached] result: NOT STORED
3.1690  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1696  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1703  22.8M  [memcached] result: NOT STORED
3.1705  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1712  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1720  22.8M  [memcached] result: NOT STORED
3.1722  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1730  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1744  22.8M  [memcached] result: NOT STORED
3.1748  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1754  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1762  22.8M  [memcached] result: NOT STORED
3.1763  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1769  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1776  22.8M  [memcached] result: NOT STORED
3.1777  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1783  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1790  22.8M  [memcached] result: NOT STORED
3.1791  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1797  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1809  22.8M  [memcached] result: NOT STORED
3.1811  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1822  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1829  22.8M  [memcached] result: NOT STORED
3.1831  22.8M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1838  22.8M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1848  22.8M  [memcached] result: NOT STORED


Looking at the memcached key  simplewiki:jobqueuegroup:taskruns:v1:

 $ mwscript mcc.php
 > get simplewiki:jobqueuegroup:taskruns:v1
 Getting simplewiki:jobqueuegroup:taskruns:v1[]
 array(31) {
   ["refreshLinks"]=>
   array(1) {
     ["recyclePruneAndUndelayJobs"]=>
     int(1396479307)
   }
 <snip>

I get a 31 elements array, each element having a timestamp of 1396479307 which is Wed, 02 Apr 2014 22:55:07 GMT.


Seems to me there is some logical error in JobQueueGroup::executeReadyPeriodicTasks() which never clear out the previous runs :-/

I am not deleting the key so we can keep investigating.
Comment 4 Aaron Schulz 2014-04-09 21:44:20 UTC
If the add() gave NOT STORED then you should have seen cas(). That sounds like some problem happens with memcached in BagOStuff::mergeViaCas().
Comment 5 Antoine "hashar" Musso (WMF) 2014-04-11 13:10:22 UTC
Do we have any unit test covering that code or anyway to potentially reproduce the issue? I am not familiar at all with the cas stuff, much less with the job queue.
Comment 6 Greg Grossmeier 2014-05-27 22:58:48 UTC
(still seeing those log entries)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links