Last modified: 2014-02-05 05:35:59 UTC
I got in qstat: job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 801291 0.32618 php_dispat local-liange Rr 08/28/2013 02:00:02 continuous@tools-exec-05.pmtpa 1 869600 0.26803 php_dispat local-liange r 08/27/2013 19:00:17 continuous@tools-exec-01.pmtpa 1 with having a jstart call in crontab. I guess it's because jstart didn't see that Rr task and started a new one. Category State SGE Letter Code Running running r Running running, re-submit Rr
I can't reproduce that: | scfc@tools-login:~$ echo sleep 10m > sleep-test.sh && chmod +x sleep-test.sh | scfc@tools-login:~$ jstart -N sleep-test ./sleep-test.sh | Your job 2415536 ("sleep-test") has been submitted | scfc@tools-login:~$ qstat | job-ID prior name user state submit/start at queue slots ja-task-ID | ----------------------------------------------------------------------------------------------------------------- | 2415536 0.25000 sleep-test scfc r 02/03/2014 03:03:38 continuous@tools-exec-06.pmtpa 1 | scfc@tools-login:~$ qmod -rj 2415536 | Pushed rescheduling of job 2415536 on host tools-exec-06.pmtpa.wmflabs | scfc@tools-login:~$ qstat | job-ID prior name user state submit/start at queue slots ja-task-ID | ----------------------------------------------------------------------------------------------------------------- | 2415536 0.25000 sleep-test scfc Rr 02/03/2014 03:04:38 continuous@tools-exec-03.pmtpa 1 | scfc@tools-login:~$ jstart -N sleep-test ./sleep-test.sh | scfc@tools-login:~$ qstat | job-ID prior name user state submit/start at queue slots ja-task-ID | ----------------------------------------------------------------------------------------------------------------- | 2415536 0.25000 sleep-test scfc Rr 02/03/2014 03:04:38 continuous@tools-exec-03.pmtpa 1 | scfc@tools-login:~$
So is there any other possible cause for the original issue?
There is always the possibility of a race condition; there is no locking, so if you jstart twice within a very short period of time (a few seconds) both invocations would so none running and start; but that seems unlikely if you start with cron unless the interval is fairly short and tools-login was *really* loaded.
(In reply to comment #3) > There is always the possibility of a race condition; there is no locking, so > if > you jstart twice within a very short period of time (a few seconds) both > invocations would so none running and start; but that seems unlikely if you > start with cron unless the interval is fairly short and tools-login was > *really* loaded. That cron entry is "0/10 * * * * $HOME/mw/startLabsDispatchRC.sh". Is the interval too short? Also do you think it's a good bug report (so it's not WONTFIXed) about having no locking?
10 minutes seems long enough that I'm really surprised this could have happened at all; I might have expected it to happen at the 1-2 minute range at the most. Locking would be a reasonable added safeguard, and even when cron gets replaced it would remain useful, but has a few implementation gotchas that will be tricky to get right. Nevertheless, having a bug for it would not be a bad thing.
So bug 60862.