Last modified: 2014-10-27 20:05:37 UTC
In bug 53629, Coren mentioned: "there is no locking, so if you jstart twice within a very short period of time (a few seconds) both invocations would so none running and start" "Locking would be a reasonable added safeguard, and even when cron gets replaced it would remain useful, but has a few implementation gotchas that will be tricky to get right. Nevertheless, having a bug for it would not be a bad thing."
*** Bug 69867 has been marked as a duplicate of this bug. ***
Hm.. if bug 69867 is a dupe of this, how come it was jstarted twice in such a short period of time? It runs every 2 or 5 minutes. Surely 'jstart' doesn't take that long to run?
I assume (= my reasoning for flagging bug #69867 as a dupe) that network congestion and/or load on the client, the SGE server or the DNS server can cause jstart to take such "long" to run (or fail at the "check if job is running" phase).
Happened again at tools.wmfdbbbot: qstat: 3343394 0.31914 dbbot-wm tools.wmfdbb r 08/21/2014 18:40:18 continuous@tools-exec-13.eqiad 1 3488693 0.30383 dbbot-wm tools.wmfdbb r 08/27/2014 04:35:23 continuous@tools-exec-10.eqiad 1 3505161 0.30205 dbbot-wm tools.wmfdbb r 08/27/2014 19:40:27 continuous@tools-exec-06.eqiad 1 crontab: */5 * * * * /usr/bin/jsub -N dbbot-wm -once -continuous -quiet -mem 1280M -o /dev/null php ~/apps/ts-krinkle-Kribo/Init.php
And tools.ecmabot: 3343333 0.31965 ecmabot-wm tools.ecmabo r 08/21/2014 18:36:18 continuous@tools-exec-09.eqiad 1 3371234 0.31670 ecmabot-wm tools.ecmabo r 08/22/2014 19:36:27 continuous@tools-exec-08.eqiad 1 3376532 0.31614 ecmabot-wm tools.ecmabo r 08/23/2014 00:22:27 continuous@tools-exec-10.eqiad 1 3383752 0.31536 ecmabot-wm tools.ecmabo r 08/23/2014 07:00:27 continuous@tools-exec-13.eqiad 1 3450460 0.30830 ecmabot-wm tools.ecmabo r 08/25/2014 18:56:16 continuous@tools-exec-10.eqiad 1 3467240 0.30656 ecmabot-wm tools.ecmabo r 08/26/2014 09:44:23 continuous@tools-exec-11.eqiad 1 3477929 0.30546 ecmabot-wm tools.ecmabo r 08/26/2014 19:06:23 continuous@tools-exec-01.eqiad 1 3484066 0.30482 ecmabot-wm tools.ecmabo r 08/27/2014 00:34:23 continuous@tools-exec-11.eqiad 1 3484971 0.30472 ecmabot-wm tools.ecmabo r 08/27/2014 01:20:23 continuous@tools-exec-11.eqiad 1 3486786 0.30454 ecmabot-wm tools.ecmabo r 08/27/2014 02:56:23 continuous@tools-exec-04.eqiad 1 3488164 0.30440 ecmabot-wm tools.ecmabo r 08/27/2014 04:06:23 continuous@tools-exec-13.eqiad 1 3495944 0.30356 ecmabot-wm tools.ecmabo r 08/27/2014 11:16:23 continuous@tools-exec-06.eqiad 1 3499118 0.30321 ecmabot-wm tools.ecmabo r 08/27/2014 14:10:26 continuous@tools-exec-13.eqiad 1 3503921 0.30270 ecmabot-wm tools.ecmabo r 08/27/2014 18:34:27 continuous@tools-exec-04.eqiad 1 3506230 0.30245 ecmabot-wm tools.ecmabo r 08/27/2014 20:38:27 continuous@tools-exec-06.eqiad 1 Is there some change we're expected to make in how we invoke it from cron?
(In reply to Krinkle from comment #5) > [...] > Is there some change we're expected to make in how we invoke it from cron? You can use bigbrother as described in http://permalink.gmane.org/gmane.org.wikimedia.labs/2757. As there is only one bigbrother instance that processes all jobs sequentially, no race conditions can occur. The interval between checks is 10 seconds when idling.
Still happening. ecmabot-wm had two instances running on IRC (and while only 2 were visible on Freenode IRC, it turns out many more were running on the server). qtools.ecmabot@tools-login:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 4411311 0.35894 ecmabot-wm tools.ecmabo r 09/29/2014 21:37:12 continuous@tools-exec-14.eqiad 1 4485809 0.35300 ecmabot-wm tools.ecmabo r 10/02/2014 17:12:57 continuous@tools-exec-15.eqiad 1 4485842 0.35299 ecmabot-wm tools.ecmabo r 10/02/2014 17:13:57 continuous@tools-exec-01.eqiad 1 4485847 0.35299 ecmabot-wm tools.ecmabo r 10/02/2014 17:13:57 continuous@tools-exec-02.eqiad 1 4485858 0.35299 ecmabot-wm tools.ecmabo Rr 10/07/2014 23:40:42 continuous@tools-exec-01.eqiad 1 tools.ecmabot@tools-login:~$ jstop ecmabot-wm tools.ecmabot has registered the job 4411311 for deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm job 4411311 is already in deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm tools.ecmabot has registered the job 4485809 for deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm tools.ecmabot has registered the job 4485842 for deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm job 4485842 is already in deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm tools.ecmabot has registered the job 4485847 for deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm job 4485847 is already in deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm tools.ecmabot has registered the job 4485858 for deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm job 4485858 is already in deletion tools.ecmabot@tools-login:~$ jstop ecmabot-wm No job 'ecmabot-wm' is currently queued or running
(In reply to Tim Landscheidt from comment #6) > (In reply to Krinkle from comment #5) > > [...] > > > Is there some change we're expected to make in how we invoke it from cron? > > You can use bigbrother as described in > http://permalink.gmane.org/gmane.org.wikimedia.labs/2757. As there is only > one bigbrother instance that processes all jobs sequentially, no race > conditions can occur. The interval between checks is 10 seconds when idling. I'm generally avoiding to use things recommended in mailing lists because they're fixed in time. There's no way of telling whether that recommendation has changed, and for new users they'd never know. If this is really what users should be using instead of jsub, maybe it should be documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help