Last modified: 2014-11-20 16:36:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72597, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70597 - Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Status: REOPENED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 73659 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-09 02:48 UTC by Greg Grossmeier
Modified: 2014-11-20 16:36 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Raw thread dump from Jenkins (135.20 KB, text/plain)
2014-11-04 14:22 UTC, Antoine "hashar" Musso (WMF)
Details

Description Greg Grossmeier 2014-09-09 02:48:02 UTC
From James' email to the QA list:

Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours
ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for
tomorrow and found that.

Going to https://integration.wikimedia.org/ci/view/Beta/ I found that
"beta-update-databases-eqiad" had been executing for 12 hours, and
initially assumed that we had a run-away update.php issue again. However,
on examining it looks like "deployment-bastion.eqiad", or the jenkins
executor on it, isn't responding in some way:

pending—Waiting for next available executor on deployment-bastion.eqiad


I terminated the beta-update-databases-eqiad run to see if that would help,
but it just switched over to beta-scap-eqiad being the pending task.

Having chatted with MaxSem, I briefly disabled in the jenkins interface the
deployment-bastion.eqiad node and then re-enabled it, to no effect.

Any ideas?
Comment 1 Bryan Davis 2014-09-09 03:04:55 UTC
This happens once in a while. It's some sort of deadlock in Jenkins itself. Here's how I generally try to resolve it:

* Take deployment-bastion offline in Jenkins <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/markOffline>
* Kill any jenkins jobs running on deployment-bastion via Jenkins UI
* Kill all pending jobs in the Jenkins queue that are "waiting on executors"
* Disconnect deployment-bastion <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/disconnect>
* Bring deployment-bastion back online (button labeled "Bring this node back online")
* Launch slave agent (there's a button that says this)
* Check agent log to see that it connected <https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/log>

Sometimes you have to do this whole dance twice before Jenkins realizes that the there are a bunch of executors that it can use.

This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.
Comment 2 James Forrester 2014-09-09 04:58:27 UTC
Thanks! Should we write these down somewhere for the next time it occurs?
Comment 3 Bryan Davis 2014-09-09 05:00:58 UTC
(In reply to James Forrester from comment #2)
> Thanks! Should we write these down somewhere for the next time it occurs?

Greg was way ahead of you -- https://www.mediawiki.org/w/index.php?title=Continuous_integration/Jenkins&diff=1144517&oldid=1091321
Comment 4 James Forrester 2014-09-09 05:03:44 UTC
Ha. Thanks, both!
Comment 5 Chris McMahon 2014-09-09 16:19:20 UTC
Meta: one of the themes for RelEng coming up is improving Jenkins performance, this list of issues is a good place to start.
Comment 6 Bryan Davis 2014-09-09 23:46:06 UTC
This happened again today. It seems like it always involves the database update job. Maybe related to https://issues.jenkins-ci.org/browse/JENKINS-10944. Apparently the parent build for a matrix job is not supposed to occupy an executor, but sometimes it does. Seems to be an open bug still upstream.

I have manually changed the config for the beta-update-databases-eqiad job in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep Jenkins from confusing itself. If this "works" then the changes should be backported into the JJB configuration to keep it from coming back the next time the job is updated from config.

* Maximum Total Concurrent Builds: 2
* Maximum Concurrent Builds Per Node: 2
* Throttle Matrix master builds: true
* Throttle Matrix configuration builds: true

(reopening to keep this on my radar for a few days)
Comment 7 Greg Grossmeier 2014-09-23 22:55:16 UTC
Sept 23rd: 22:51 bd808: Jenkins stuck trying to update database in beta again with the dumb "waiting for executors" bug/problem

(just keeping a log of when it occurs)
Comment 8 Bryan Davis 2014-09-23 23:18:13 UTC
(In reply to Bryan Davis from comment #6)
> 
> I have manually changed the config for the beta-update-databases-eqiad job
> in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep
> Jenkins from confusing itself. If this "works" then the changes should be
> backported into the JJB configuration to keep it from coming back the next
> time the job is updated from config.
> 
> * Maximum Total Concurrent Builds: 2
> * Maximum Concurrent Builds Per Node: 2
> * Throttle Matrix master builds: true
> * Throttle Matrix configuration builds: true
> 
> (reopening to keep this on my radar for a few days)

The throttle settings are still in place, so they didn't fix the problem. The fact that we didn't see this error for a couple of weeks is apparently uncorrelated.
Comment 9 Antoine "hashar" Musso (WMF) 2014-10-24 14:57:56 UTC
That happened again on Oct 23 2014.  Looking at Jenkins thread dumps, the deployment-bastion execute thread are locked by the Gearman plugin.   It must have some logic error somewhere but I can't really debug Java ;(
Comment 10 Bryan Davis 2014-10-31 00:21:45 UTC
Un-licking this cookie. I took a shot at it but the problem seems to be deeper in the stack.
Comment 11 Antoine "hashar" Musso (WMF) 2014-10-31 01:03:25 UTC
I forgot to update this bug after my debugging session on Oct 24th here is the rough brain dump:

The Gearman plugin source code is at https://review.openstack.org/openstack-infra/gearman-plugin.git

I have added a logger for the Gearman plugin ( hudson.plugins.gearman.logger @ INFO ):

https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/

Whenever the issue occurs we can switch it to FINE and get some debug messages.


IIRC the bastion executors threads were held in lock() within src/main/java/hudson/plugins/gearman/NodeAvailabilityMonitor.java lock() function. It takes a worker as parameter and has a wait(5000).

I think the debug message had a 'null' worker possibly ending in a dead end of being always considered busy and thus waiting.

That might be a bad interaction with the jobs that are scheduled by Jenkins itself such as the hourly database update that runs on deployment bastion as well.


And of course: I have no idea how to reproduce it nor how to hook a debugger in Jenkins nor how to dump the state of variables :-/
Comment 12 Antoine "hashar" Musso (WMF) 2014-11-04 14:22:45 UTC
Created attachment 17012 [details]
Raw thread dump from Jenkins

deployment-bastion.eqiad node is in a deadlock right now, I took a thread dump from Jenkins using https://integration.wikimedia.org/ci/monitoring?part=threadsDump

The Jenkins Gearman plugin seems to lock the node executors, with threads looking like:

  "Gearman worker deployment-bastion.eqiad_exec-0" daemon prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388)
	hudson.plugins.gearman.MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422)
	hudson.plugins.gearman.MyGearmanWorkerImpl$GrabJobEventHandler.handleEvent(MyGearmanWorkerImpl.java:99)
	org.gearman.common.GearmanTask.handleGearmanIOEvent(GearmanTask.java:92)
	org.gearman.common.GearmanJobServerSession.handleResSessionEvent(GearmanJobServerSession.java:329)
	org.gearman.common.GearmanJobServerSession.handleSessionEvent(GearmanJobServerSession.java:240)
	org.gearman.common.GearmanJobServerSession.driveSessionIO(GearmanJobServerSession.java:203)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:357)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:701)
Comment 13 Antoine "hashar" Musso (WMF) 2014-11-04 14:24:31 UTC
From Jenkins log https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/  ( hudson.plugins.gearman.logger @ FINEST ) there is a spam of:

AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1
Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1
Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
Comment 14 Antoine "hashar" Musso (WMF) 2014-11-04 14:30:57 UTC
The rest of the executors are hold in lock with:

"Gearman worker deployment-bastion.eqiad_exec-2"
   daemon prio=5 TIMED_WAITING

 java.lang.Object.wait(Native Method)
 hudson.plugins.gearman
   .NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
   .MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388)
   .MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422)
Comment 15 Antoine "hashar" Musso (WMF) 2014-11-04 16:22:07 UTC
I disconnected the Jenkins Gearman client  which is holding the lock. That unleashed the jobs that are meant to be run on deployment-bastion.eqiad.
Comment 16 Antoine "hashar" Musso (WMF) 2014-11-05 10:12:59 UTC
I am moving this bug to Continous integration since its root cause is somewhere in Jenkins Gearman plugin. Will make it easier for me to find that bug again.
Comment 17 Antoine "hashar" Musso (WMF) 2014-11-05 10:18:43 UTC
Most executors are held in lock as mentioned on comment #14.  I noticed one thread having a different stacktrace:

"Gearman worker deployment-bastion.eqiad_exec-1" daemon prio=5 WAITING
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
    hudson.plugins.gearman
        .StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
    hudson.plugins.gearman
        .StartJobWorker.executeFunction(StartJobWorker.java:114)
    org.gearman.worker
        .AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
    org.gearman.worker
        .AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
    hudson.plugins.gearman.MyGearmanWorkerImpl
        .submitFunction(MyGearmanWorkerImpl.java:590)
    hudson.plugins.gearman.MyGearmanWorkerImpl
        .work(MyGearmanWorkerImpl.java:374)
    hudson.plugins.gearman.AbstractWorkerThread
        .run(AbstractWorkerThread.java:166)
    java.lang.Thread.run(Thread.java:701)
Comment 18 Antoine "hashar" Musso (WMF) 2014-11-05 11:06:01 UTC
This need to reported to upstream https://bugs.launchpad.net/gearman-plugin/
Comment 19 Antoine "hashar" Musso (WMF) 2014-11-20 16:36:04 UTC
*** Bug 73659 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links