Last modified: 2014-10-27 20:05:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62862, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60862 - Tool Labs: jsub should prevent starting duplicate jobs for -once tasks
Tool Labs: jsub should prevent starting duplicate jobs for -once tasks
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
: 69867 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-05 05:35 UTC by Liangent
Modified: 2014-10-27 20:05 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Liangent 2014-02-05 05:35:25 UTC
In bug 53629, Coren mentioned: "there is no locking, so if you jstart twice within a very short period of time (a few seconds) both invocations would so none running and start" "Locking would be a reasonable added safeguard, and even when cron gets replaced it would remain useful, but has a few implementation gotchas that will be tricky to get right.  Nevertheless, having a bug for it would not be a bad thing."
Comment 1 Tim Landscheidt 2014-08-21 18:44:09 UTC
*** Bug 69867 has been marked as a duplicate of this bug. ***
Comment 2 Krinkle 2014-08-21 19:28:55 UTC
Hm.. if bug 69867 is a dupe of this, how come it was jstarted twice in such a short period of time? It runs every 2 or 5 minutes. Surely 'jstart' doesn't take that long to run?
Comment 3 Tim Landscheidt 2014-08-21 19:39:54 UTC
I assume (= my reasoning for flagging bug #69867 as a dupe) that network congestion and/or load on the client, the SGE server or the DNS server can cause jstart to take such "long" to run (or fail at the "check if job is running" phase).
Comment 4 Krinkle 2014-08-28 13:05:24 UTC
Happened again at tools.wmfdbbbot:

qstat:
3343394 0.31914 dbbot-wm   tools.wmfdbb r     08/21/2014 18:40:18 continuous@tools-exec-13.eqiad     1        
3488693 0.30383 dbbot-wm   tools.wmfdbb r     08/27/2014 04:35:23 continuous@tools-exec-10.eqiad     1        
3505161 0.30205 dbbot-wm   tools.wmfdbb r     08/27/2014 19:40:27 continuous@tools-exec-06.eqiad     1        

crontab:
*/5 * * * * /usr/bin/jsub -N dbbot-wm -once -continuous -quiet -mem 1280M -o /dev/null php ~/apps/ts-krinkle-Kribo/Init.php
Comment 5 Krinkle 2014-08-28 17:29:01 UTC
And tools.ecmabot:

3343333 0.31965 ecmabot-wm tools.ecmabo r     08/21/2014 18:36:18 continuous@tools-exec-09.eqiad     1        
3371234 0.31670 ecmabot-wm tools.ecmabo r     08/22/2014 19:36:27 continuous@tools-exec-08.eqiad     1        
3376532 0.31614 ecmabot-wm tools.ecmabo r     08/23/2014 00:22:27 continuous@tools-exec-10.eqiad     1        
3383752 0.31536 ecmabot-wm tools.ecmabo r     08/23/2014 07:00:27 continuous@tools-exec-13.eqiad     1        
3450460 0.30830 ecmabot-wm tools.ecmabo r     08/25/2014 18:56:16 continuous@tools-exec-10.eqiad     1        
3467240 0.30656 ecmabot-wm tools.ecmabo r     08/26/2014 09:44:23 continuous@tools-exec-11.eqiad     1        
3477929 0.30546 ecmabot-wm tools.ecmabo r     08/26/2014 19:06:23 continuous@tools-exec-01.eqiad     1        
3484066 0.30482 ecmabot-wm tools.ecmabo r     08/27/2014 00:34:23 continuous@tools-exec-11.eqiad     1        
3484971 0.30472 ecmabot-wm tools.ecmabo r     08/27/2014 01:20:23 continuous@tools-exec-11.eqiad     1        
3486786 0.30454 ecmabot-wm tools.ecmabo r     08/27/2014 02:56:23 continuous@tools-exec-04.eqiad     1        
3488164 0.30440 ecmabot-wm tools.ecmabo r     08/27/2014 04:06:23 continuous@tools-exec-13.eqiad     1        
3495944 0.30356 ecmabot-wm tools.ecmabo r     08/27/2014 11:16:23 continuous@tools-exec-06.eqiad     1        
3499118 0.30321 ecmabot-wm tools.ecmabo r     08/27/2014 14:10:26 continuous@tools-exec-13.eqiad     1        
3503921 0.30270 ecmabot-wm tools.ecmabo r     08/27/2014 18:34:27 continuous@tools-exec-04.eqiad     1        
3506230 0.30245 ecmabot-wm tools.ecmabo r     08/27/2014 20:38:27 continuous@tools-exec-06.eqiad     1 


Is there some change we're expected to make in how we invoke it from cron?
Comment 6 Tim Landscheidt 2014-08-29 00:17:12 UTC
(In reply to Krinkle from comment #5)
> [...]

> Is there some change we're expected to make in how we invoke it from cron?

You can use bigbrother as described in http://permalink.gmane.org/gmane.org.wikimedia.labs/2757.  As there is only one bigbrother instance that processes all jobs sequentially, no race conditions can occur.  The interval between checks is 10 seconds when idling.
Comment 7 Krinkle 2014-10-27 20:02:15 UTC
Still happening. ecmabot-wm had two instances running on IRC (and while only 2 were visible on Freenode IRC, it turns out many more were running on the server).

qtools.ecmabot@tools-login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
4411311 0.35894 ecmabot-wm tools.ecmabo r     09/29/2014 21:37:12 continuous@tools-exec-14.eqiad     1        
4485809 0.35300 ecmabot-wm tools.ecmabo r     10/02/2014 17:12:57 continuous@tools-exec-15.eqiad     1        
4485842 0.35299 ecmabot-wm tools.ecmabo r     10/02/2014 17:13:57 continuous@tools-exec-01.eqiad     1        
4485847 0.35299 ecmabot-wm tools.ecmabo r     10/02/2014 17:13:57 continuous@tools-exec-02.eqiad     1        
4485858 0.35299 ecmabot-wm tools.ecmabo Rr    10/07/2014 23:40:42 continuous@tools-exec-01.eqiad     1        
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
tools.ecmabot has registered the job 4411311 for deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
job 4411311 is already in deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
tools.ecmabot has registered the job 4485809 for deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
tools.ecmabot has registered the job 4485842 for deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
job 4485842 is already in deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
tools.ecmabot has registered the job 4485847 for deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
job 4485847 is already in deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
tools.ecmabot has registered the job 4485858 for deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
job 4485858 is already in deletion
tools.ecmabot@tools-login:~$ jstop ecmabot-wm
No job 'ecmabot-wm' is currently queued or running
Comment 8 Krinkle 2014-10-27 20:05:37 UTC
(In reply to Tim Landscheidt from comment #6)
> (In reply to Krinkle from comment #5)
> > [...]
> 
> > Is there some change we're expected to make in how we invoke it from cron?
> 
> You can use bigbrother as described in
> http://permalink.gmane.org/gmane.org.wikimedia.labs/2757.  As there is only
> one bigbrother instance that processes all jobs sequentially, no race
> conditions can occur.  The interval between checks is 10 seconds when idling.

I'm generally avoiding to use things recommended in mailing lists because they're fixed in time. There's no way of telling whether that recommendation has changed, and for new users they'd never know. If this is really what users should be using instead of jsub, maybe it should be documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links