Last modified: 2014-10-17 08:16:21 UTC
18 total queue slots are either in alarm or error state. One webserver slot is dead. As new queue jobs won't start on slots that are either in alarm or error states, any new job submissions are piling up and not running. Any miracles?
Commands to show that are very welcome, for curious non-techies like me.
$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- mailq@tools-exec-01.eqiad.wmfl BP 0/0/5 1.31 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-02.eqiad.wmfl BP 0/0/5 2.95 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-03.eqiad.wmfl BP 0/0/5 9.19 lx26-amd64 aE --------------------------------------------------------------------------------- mailq@tools-exec-04.eqiad.wmfl BP 0/0/5 3.62 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-05.eqiad.wmfl BP 0/0/5 3.07 lx26-amd64 --------------------------------------------------------------------------------- mailq@tools-exec-06.eqiad.wmfl BP 0/0/5 1.73 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-07.eqiad.wmfl BP 0/0/5 1.17 lx26-amd64 --------------------------------------------------------------------------------- mailq@tools-exec-08.eqiad.wmfl BP 0/0/5 1.29 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-09.eqiad.wmfl BP 0/0/5 1.14 lx26-amd64 E --------------------------------------------------------------------------------- mailq@tools-exec-10.eqiad.wmfl BP 0/0/5 0.57 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-01.eqiad.wmfla BIP 0/0/50 1.31 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-02.eqiad.wmfla BIP 0/12/50 2.95 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-03.eqiad.wmfla BIP 0/3/50 9.19 lx26-amd64 aE --------------------------------------------------------------------------------- task@tools-exec-04.eqiad.wmfla BIP 0/5/50 3.62 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-05.eqiad.wmfla BIP 0/7/50 3.07 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-06.eqiad.wmfla BIP 0/4/50 1.73 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-07.eqiad.wmfla BIP 0/25/50 1.17 lx26-amd64 --------------------------------------------------------------------------------- task@tools-exec-08.eqiad.wmfla BIP 0/5/50 1.29 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-09.eqiad.wmfla BIP 0/3/50 1.14 lx26-amd64 E --------------------------------------------------------------------------------- task@tools-exec-10.eqiad.wmfla BIP 0/5/50 0.57 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-01.eqiad BC 0/10/50 1.31 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-02.eqiad BC 0/13/50 2.95 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-03.eqiad BC 0/19/50 9.19 lx26-amd64 a --------------------------------------------------------------------------------- continuous@tools-exec-04.eqiad BC 0/17/50 3.62 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-05.eqiad BC 0/10/50 3.07 lx26-amd64 --------------------------------------------------------------------------------- continuous@tools-exec-06.eqiad BC 0/23/50 1.73 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-07.eqiad BC 0/6/50 1.17 lx26-amd64 --------------------------------------------------------------------------------- continuous@tools-exec-08.eqiad BC 0/19/50 1.29 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-09.eqiad BC 0/15/50 1.14 lx26-amd64 E --------------------------------------------------------------------------------- continuous@tools-exec-10.eqiad BC 0/11/50 0.57 lx26-amd64 E --------------------------------------------------------------------------------- webgrid-lighttpd@tools-webgrid B 0/249/256 6.86 lx26-amd64 E --------------------------------------------------------------------------------- webgrid-lighttpd@tools-webgrid B 0/76/256 1.26 lx26-amd64 E --------------------------------------------------------------------------------- webgrid-lighttpd@tools-webgrid B 0/26/256 0.31 lx26-amd64 --------------------------------------------------------------------------------- webgrid-lighttpd@tools-webgrid B 0/0/256 -NA- -NA- au --------------------------------------------------------------------------------- webgrid-tomcat@tools-webgrid-t B 0/5/32 0.92 lx26-amd64 --------------------------------------------------------------------------------- cyberbot@tools-exec-cyberbot.e BC 0/15/1000 2.37 lx26-amd64 --------------------------------------------------------------------------------- giftbot@tools-exec-gift.eqiad. BC 0/1/1000 0.04 lx26-amd64 --------------------------------------------------------------------------------- wmt@tools-exec-wmt.eqiad.wmfla BC 0/14/1000 0.09 lx26-amd64 (getting the actual hostname is probably only possible by passing -xml) E = 'Error', a = 'alarm'
and, from man qstat: If the state is a(larm) at least on of the load thresholds defined in the load_thresholds list of the queue configuration (see queue_conf(5)) is currently exceeded, which prevents from scheduling further jobs to that queue. As opposed to this, the state A(larm) indicates that at least one of the suspend thresholds of the queue (see queue_conf(5)) is currently exceeded. This will result in jobs running in that queue being successively suspended until no threshold is violated. The states s(uspended) and d(isabled) can be assigned to queues and released via the qmod(1) command. Suspending a queue will cause all jobs executing in that queue to be suspended. (...) If an E(rror) state is displayed for a queue, sge_execd(8) on that host was unable to locate the sge_shepherd(8) executable on that host in order to start a job. Please check the error logfile of that sge_execd(8) for leads on how to resolve the problem. Please enable the queue afterwards via the -c option of the qmod(1) command manually.
Anyone willing to work on this? YuviPanda? Btw: queues + hostname with: $ qhost -q
I've reset all queues, but they become stuck again because /var is full on the hosts. Assigning to Yuvi for diamond & biglog.
They seem ok now, and diamond no longer fills up /var...