Last modified: 2013-08-31 03:34:12 UTC
A queue machine has been showing an alarm state via qstat for several days. Jobs cannot be deleted or submitted to the machine.
After investigation, what happened is that the grid scheduler actually sent job to the new execution node before puppet had finished configuring it fully, placing the queue in an inconsistent state (the execution node could not properly attribute the jobs to the correct master). Forcing completion of the puppet run and restarting the node put it back in proper working order; any continuous tasks that had been scheduled on it have been restarted automatically.
Something still isn't right. The queue on the machine isn't reporting an alert state anymore. However, I'm still unable to delete a current job from the machine, even with the -f option. It is listed as being registered for deletion. Load average listed is also very high.
Could you give me the job number that is giving you issues? I may be able to diagnose it further.
Job #871376 Name of job: dumpmuncher-frwiki
I'm not entirely sure why, but your job was stuck waiting of a bzcat that got stuck hard. Killing it (the bzcat) allowed your job to die and the job to be canceled.