Last modified: 2014-11-20 16:36:04 UTC
From James' email to the QA list: Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for tomorrow and found that. Going to https://integration.wikimedia.org/ci/view/Beta/ I found that "beta-update-databases-eqiad" had been executing for 12 hours, and initially assumed that we had a run-away update.php issue again. However, on examining it looks like "deployment-bastion.eqiad", or the jenkins executor on it, isn't responding in some way: pending—Waiting for next available executor on deployment-bastion.eqiad I terminated the beta-update-databases-eqiad run to see if that would help, but it just switched over to beta-scap-eqiad being the pending task. Having chatted with MaxSem, I briefly disabled in the jenkins interface the deployment-bastion.eqiad node and then re-enabled it, to no effect. Any ideas?
This happens once in a while. It's some sort of deadlock in Jenkins itself. Here's how I generally try to resolve it: * Take deployment-bastion offline in Jenkins <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/markOffline> * Kill any jenkins jobs running on deployment-bastion via Jenkins UI * Kill all pending jobs in the Jenkins queue that are "waiting on executors" * Disconnect deployment-bastion <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/disconnect> * Bring deployment-bastion back online (button labeled "Bring this node back online") * Launch slave agent (there's a button that says this) * Check agent log to see that it connected <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/log> Sometimes you have to do this whole dance twice before Jenkins realizes that the there are a bunch of executors that it can use. This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.
Thanks! Should we write these down somewhere for the next time it occurs?
(In reply to James Forrester from comment #2) > Thanks! Should we write these down somewhere for the next time it occurs? Greg was way ahead of you -- https://www.mediawiki.org/w/index.php?title=Continuous_integration/Jenkins&diff=1144517&oldid=1091321
Ha. Thanks, both!
Meta: one of the themes for RelEng coming up is improving Jenkins performance, this list of issues is a good place to start.
This happened again today. It seems like it always involves the database update job. Maybe related to https://issues.jenkins-ci.org/browse/JENKINS-10944. Apparently the parent build for a matrix job is not supposed to occupy an executor, but sometimes it does. Seems to be an open bug still upstream. I have manually changed the config for the beta-update-databases-eqiad job in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep Jenkins from confusing itself. If this "works" then the changes should be backported into the JJB configuration to keep it from coming back the next time the job is updated from config. * Maximum Total Concurrent Builds: 2 * Maximum Concurrent Builds Per Node: 2 * Throttle Matrix master builds: true * Throttle Matrix configuration builds: true (reopening to keep this on my radar for a few days)
Sept 23rd: 22:51 bd808: Jenkins stuck trying to update database in beta again with the dumb "waiting for executors" bug/problem (just keeping a log of when it occurs)
(In reply to Bryan Davis from comment #6) > > I have manually changed the config for the beta-update-databases-eqiad job > in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep > Jenkins from confusing itself. If this "works" then the changes should be > backported into the JJB configuration to keep it from coming back the next > time the job is updated from config. > > * Maximum Total Concurrent Builds: 2 > * Maximum Concurrent Builds Per Node: 2 > * Throttle Matrix master builds: true > * Throttle Matrix configuration builds: true > > (reopening to keep this on my radar for a few days) The throttle settings are still in place, so they didn't fix the problem. The fact that we didn't see this error for a couple of weeks is apparently uncorrelated.
That happened again on Oct 23 2014. Looking at Jenkins thread dumps, the deployment-bastion execute thread are locked by the Gearman plugin. It must have some logic error somewhere but I can't really debug Java ;(
Un-licking this cookie. I took a shot at it but the problem seems to be deeper in the stack.
I forgot to update this bug after my debugging session on Oct 24th here is the rough brain dump: The Gearman plugin source code is at https://review.openstack.org/openstack-infra/gearman-plugin.git I have added a logger for the Gearman plugin ( hudson.plugins.gearman.logger @ INFO ): https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/ Whenever the issue occurs we can switch it to FINE and get some debug messages. IIRC the bastion executors threads were held in lock() within src/main/java/hudson/plugins/gearman/NodeAvailabilityMonitor.java lock() function. It takes a worker as parameter and has a wait(5000). I think the debug message had a 'null' worker possibly ending in a dead end of being always considered busy and thus waiting. That might be a bad interaction with the jobs that are scheduled by Jenkins itself such as the hourly database update that runs on deployment bastion as well. And of course: I have no idea how to reproduce it nor how to hook a debugger in Jenkins nor how to dump the state of variables :-/
Created attachment 17012 [details] Raw thread dump from Jenkins deployment-bastion.eqiad node is in a deadlock right now, I took a thread dump from Jenkins using https://integration.wikimedia.org/ci/monitoring?part=threadsDump The Jenkins Gearman plugin seems to lock the node executors, with threads looking like: "Gearman worker deployment-bastion.eqiad_exec-0" daemon prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388) hudson.plugins.gearman.MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422) hudson.plugins.gearman.MyGearmanWorkerImpl$GrabJobEventHandler.handleEvent(MyGearmanWorkerImpl.java:99) org.gearman.common.GearmanTask.handleGearmanIOEvent(GearmanTask.java:92) org.gearman.common.GearmanJobServerSession.handleResSessionEvent(GearmanJobServerSession.java:329) org.gearman.common.GearmanJobServerSession.handleSessionEvent(GearmanJobServerSession.java:240) org.gearman.common.GearmanJobServerSession.driveSessionIO(GearmanJobServerSession.java:203) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:357) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:701)
From Jenkins log https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/ ( hudson.plugins.gearman.logger @ FINEST ) there is a spam of: AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1 Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1 Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
The rest of the executors are hold in lock with: "Gearman worker deployment-bastion.eqiad_exec-2" daemon prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman .NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) .MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388) .MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422)
I disconnected the Jenkins Gearman client which is holding the lock. That unleashed the jobs that are meant to be run on deployment-bastion.eqiad.
I am moving this bug to Continous integration since its root cause is somewhere in Jenkins Gearman plugin. Will make it easier for me to find that bug again.
Most executors are held in lock as mentioned on comment #14. I noticed one thread having a different stacktrace: "Gearman worker deployment-bastion.eqiad_exec-1" daemon prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:502) hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) hudson.plugins.gearman .StartJobWorker.safeExecuteFunction(StartJobWorker.java:196) hudson.plugins.gearman .StartJobWorker.executeFunction(StartJobWorker.java:114) org.gearman.worker .AbstractGearmanFunction.call(AbstractGearmanFunction.java:125) org.gearman.worker .AbstractGearmanFunction.call(AbstractGearmanFunction.java:22) hudson.plugins.gearman.MyGearmanWorkerImpl .submitFunction(MyGearmanWorkerImpl.java:590) hudson.plugins.gearman.MyGearmanWorkerImpl .work(MyGearmanWorkerImpl.java:374) hudson.plugins.gearman.AbstractWorkerThread .run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:701)
This need to reported to upstream https://bugs.launchpad.net/gearman-plugin/
*** Bug 73659 has been marked as a duplicate of this bug. ***