Last modified: 2013-08-31 03:34:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55606, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53606 - queue machine down
queue machine down
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-30 22:29 UTC by bgwhite
Modified: 2013-08-31 03:34 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description bgwhite 2013-08-30 22:29:42 UTC
A queue machine has been showing an alarm state via qstat for several days. Jobs cannot be deleted or submitted to the machine.
Comment 1 Marc A. Pelletier 2013-08-30 23:54:10 UTC
After investigation, what happened is that the grid scheduler actually sent job to the new execution node before puppet had finished configuring it fully, placing the queue in an inconsistent state (the execution node could not properly attribute the jobs to the correct master).

Forcing completion of the puppet run and restarting the node put it back in proper working order; any continuous tasks that had been scheduled on it have been restarted automatically.
Comment 2 bgwhite 2013-08-31 02:36:42 UTC
Something still isn't right.  The queue on the machine isn't reporting an alert state anymore.  However, I'm still unable to delete a current job from the machine, even with the -f option.  It is listed as being registered for deletion.  Load average listed is also very high.
Comment 3 Marc A. Pelletier 2013-08-31 02:44:21 UTC
Could you give me the job number that is giving you issues?  I may be able to diagnose it further.
Comment 4 bgwhite 2013-08-31 02:48:08 UTC
Job #871376 
Name of job: dumpmuncher-frwiki
Comment 5 Marc A. Pelletier 2013-08-31 03:34:12 UTC
I'm not entirely sure why, but your job was stuck waiting of a bzcat that got stuck hard.  Killing it (the bzcat) allowed your job to die and the job to be canceled.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links