Last modified: 2014-10-17 08:16:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69347, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 67347 - Queue is mostly dead, which is slightly alive.


Summary:	Queue is mostly dead, which is slightly alive.

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	tools (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized major
Target Milestone:	---
Assigned To:	Yuvi Panda

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-07-01 05:27 UTC by bgwhite
Modified:	2014-10-17 08:16 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description bgwhite 2014-07-01 05:27:11 UTC

18 total queue slots are either in alarm or error state.  One webserver slot is dead.   As new queue jobs won't start on slots that are either in alarm or error states, any new job submissions are piling up and not running.  Any miracles?

Comment 1 Andre Klapper 2014-07-01 09:46:40 UTC

Commands to show that are very welcome, for curious non-techies like me.

Comment 2 Merlijn van Deen (test) 2014-07-01 10:02:12 UTC

$ qstat -f

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
mailq@tools-exec-01.eqiad.wmfl BP    0/0/5          1.31     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-02.eqiad.wmfl BP    0/0/5          2.95     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-03.eqiad.wmfl BP    0/0/5          9.19     lx26-amd64    aE
---------------------------------------------------------------------------------
mailq@tools-exec-04.eqiad.wmfl BP    0/0/5          3.62     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-05.eqiad.wmfl BP    0/0/5          3.07     lx26-amd64
---------------------------------------------------------------------------------
mailq@tools-exec-06.eqiad.wmfl BP    0/0/5          1.73     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-07.eqiad.wmfl BP    0/0/5          1.17     lx26-amd64
---------------------------------------------------------------------------------
mailq@tools-exec-08.eqiad.wmfl BP    0/0/5          1.29     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-09.eqiad.wmfl BP    0/0/5          1.14     lx26-amd64    E
---------------------------------------------------------------------------------
mailq@tools-exec-10.eqiad.wmfl BP    0/0/5          0.57     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-01.eqiad.wmfla BIP   0/0/50         1.31     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-02.eqiad.wmfla BIP   0/12/50        2.95     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-03.eqiad.wmfla BIP   0/3/50         9.19     lx26-amd64    aE
---------------------------------------------------------------------------------
task@tools-exec-04.eqiad.wmfla BIP   0/5/50         3.62     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-05.eqiad.wmfla BIP   0/7/50         3.07     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-06.eqiad.wmfla BIP   0/4/50         1.73     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-07.eqiad.wmfla BIP   0/25/50        1.17     lx26-amd64
---------------------------------------------------------------------------------
task@tools-exec-08.eqiad.wmfla BIP   0/5/50         1.29     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-09.eqiad.wmfla BIP   0/3/50         1.14     lx26-amd64    E
---------------------------------------------------------------------------------
task@tools-exec-10.eqiad.wmfla BIP   0/5/50         0.57     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-01.eqiad BC    0/10/50        1.31     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-02.eqiad BC    0/13/50        2.95     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-03.eqiad BC    0/19/50        9.19     lx26-amd64    a
---------------------------------------------------------------------------------
continuous@tools-exec-04.eqiad BC    0/17/50        3.62     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-05.eqiad BC    0/10/50        3.07     lx26-amd64
---------------------------------------------------------------------------------
continuous@tools-exec-06.eqiad BC    0/23/50        1.73     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-07.eqiad BC    0/6/50         1.17     lx26-amd64
---------------------------------------------------------------------------------
continuous@tools-exec-08.eqiad BC    0/19/50        1.29     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-09.eqiad BC    0/15/50        1.14     lx26-amd64    E
---------------------------------------------------------------------------------
continuous@tools-exec-10.eqiad BC    0/11/50        0.57     lx26-amd64    E
---------------------------------------------------------------------------------
webgrid-lighttpd@tools-webgrid B     0/249/256      6.86     lx26-amd64    E
---------------------------------------------------------------------------------
webgrid-lighttpd@tools-webgrid B     0/76/256       1.26     lx26-amd64    E
---------------------------------------------------------------------------------
webgrid-lighttpd@tools-webgrid B     0/26/256       0.31     lx26-amd64
---------------------------------------------------------------------------------
webgrid-lighttpd@tools-webgrid B     0/0/256        -NA-     -NA-          au
---------------------------------------------------------------------------------
webgrid-tomcat@tools-webgrid-t B     0/5/32         0.92     lx26-amd64
---------------------------------------------------------------------------------
cyberbot@tools-exec-cyberbot.e BC    0/15/1000      2.37     lx26-amd64
---------------------------------------------------------------------------------
giftbot@tools-exec-gift.eqiad. BC    0/1/1000       0.04     lx26-amd64
---------------------------------------------------------------------------------
wmt@tools-exec-wmt.eqiad.wmfla BC    0/14/1000      0.09     lx26-amd64


(getting the actual hostname is probably only possible by passing -xml)

E = 'Error', a = 'alarm'

Comment 3 Merlijn van Deen (test) 2014-07-01 10:04:27 UTC

and, from man qstat:

       If the state is a(larm) at least on of the load thresholds defined in the load_thresholds list of the queue configuration (see queue_conf(5)) is currently exceeded, which prevents from scheduling further jobs to that queue.

       As  opposed to this, the state A(larm) indicates that at least one of the suspend thresholds of the queue (see queue_conf(5)) is currently exceeded. This will result in jobs running in that queue being successively suspended
       until no threshold is violated.

       The states s(uspended) and d(isabled) can be assigned to queues and released via the qmod(1) command. Suspending a queue will cause all jobs executing in that queue to be suspended.

(...)

       If  an E(rror) state is displayed for a queue, sge_execd(8) on that host was unable to locate the sge_shepherd(8) executable on that host in order to start a job. Please check the error logfile of that sge_execd(8) for leads
       on how to resolve the problem. Please enable the queue afterwards via the -c option of the qmod(1) command manually.

Comment 4 metatron 2014-07-01 12:36:52 UTC

Anyone willing to work on this? YuviPanda? Btw: queues + hostname with: 
$ qhost -q

Comment 5 Tim Landscheidt 2014-07-01 15:36:06 UTC

I've reset all queues, but they become stuck again because /var is full on the hosts.  Assigning to Yuvi for diamond & biglog.

Comment 6 Yuvi Panda 2014-10-17 08:16:21 UTC

They seem ok now, and diamond no longer fills up /var...

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links