Last modified: 2014-11-01 15:41:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74113, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72113 - [upstream] Zuul prepareRef does not handle failure to connect to Gearman
[upstream] Zuul prepareRef does not handle failure to connect to Gearman
Status: ASSIGNED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Antoine "hashar" Musso (WMF)
: upstream
Depends on:
Blocks: 63760
  Show dependency treegraph
 
Reported: 2014-10-16 00:47 UTC by Krinkle
Modified: 2014-11-01 15:41 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Screenshot of Zuul status (438.01 KB, image/png)
2014-10-16 00:47 UTC, Krinkle
Details

Description Krinkle 2014-10-16 00:47:52 UTC
Created attachment 16783 [details]
Screenshot of Zuul status

A few hours ago the Zuul queue started to get stuck. A stack of three changes in mediawiki-extensions-DonationInterface was reviewed with CodeReview+3. The top one (commit C, depending on B, depending on A) had a merge conflict in it.

C https://gerrit.wikimedia.org/r/#/c/166678/3
B https://gerrit.wikimedia.org/r/#/c/166803/1
A https://gerrit.wikimedia.org/r/#/c/166889/2

Somehow this caused Zuul's gate-and-submit pipeline to get stuck indefinitely. Even changes from mwext-VisualEditor, mediawiki-core, mwext-MobileFrontend were all waiting.

Debug information from Zuul's log

-------

$ cat /var/log/zuul/zuul.log.2014-10-15 | grep '166678,3'
2014-10-15 19:17:14,453 INFO zuul.Gerrit: Updating information for 166678,3
2014-10-15 19:17:15,169 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0ff0590 166678,3> to <Pipeline check-only>
2014-10-15 19:17:15,169 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0ff0590 166678,3> to <Pipeline check-voter>
2014-10-15 19:17:15,170 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0ff0590 166678,3> to <Pipeline test>
2014-10-15 19:17:15,171 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0ff0590 166678,3> depends on changes []
2014-10-15 19:17:17,470 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0ff0590 166678,3> depends on changes []
2014-10-15 19:17:17,470 INFO zuul.Gearman: Launch job mwext-DonationInterface-lint for change <Change 0x7f17b0ff0590 166678,3> with dependent changes []
2014-10-15 19:17:17,474 INFO zuul.Gearman: Launch job mwext-DonationInterface-npm for change <Change 0x7f17b0ff0590 166678,3> with dependent changes []
2014-10-15 19:17:19,481 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0ff0590 166678,3> depends on changes []
2014-10-15 19:17:19,481 INFO zuul.Gearman: Launch job mwext-DonationInterface-testextension for change <Change 0x7f17b0ff0590 166678,3> with dependent changes []
2014-10-15 19:18:08,759 INFO zuul.IndependentPipelineManager: Reporting change <Change 0x7f17b0ff0590 166678,3>, actions: [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': 2}>]
2014-10-15 19:18:08,842 INFO zuul.Gerrit: Updating information for 166678,3
2014-10-15 23:21:16,969 INFO zuul.Gerrit: Updating information for 166678,3
2014-10-15 23:21:17,284 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0ff0590 166678,3> to <Pipeline gate-and-submit>
2014-10-15 23:21:17,284 INFO zuul.DependentPipelineManager: Reporting start, action [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': 0}>] change <Change 0x7f17b0ff0590 166678,3>
2014-10-15 23:21:17,359 INFO zuul.Gerrit: Updating information for 166678,3
2014-10-15 23:21:17,362 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0ff0590 166678,3> depends on changes [<Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:30,406 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,790 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,790 INFO zuul.Gearman: Launch job mediawiki-gate for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,794 INFO zuul.Gearman: Launch job mwext-DonationInterface-lint for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,798 INFO zuul.Gearman: Launch job mwext-DonationInterface-npm for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:48:30,564 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b13caed0 166889,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:50:13,976 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:50:13,976 INFO zuul.Gearman: Launch job mwext-DonationInterface-testextension for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:59:51,805 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0c3d250 166918,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,864 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0c3d250 166918,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,864 INFO zuul.Gearman: Launch job mwext-VisualEditor-qunit for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,868 INFO zuul.Gearman: Launch job mwext-VisualEditor-bundle-rubocop for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,876 INFO zuul.Gearman: Launch job mwext-VisualEditor-phpcs-HEAD for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,884 INFO zuul.Gearman: Launch job mediawiki-gate for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,888 INFO zuul.Gearman: Launch job mwext-VisualEditor-npm for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,893 INFO zuul.Gearman: Launch job mwext-VisualEditor-lint for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,957 INFO zuul.Gearman: Launch job mwext-VisualEditor-doc-test for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:56,023 INFO zuul.Gearman: Launch job experiment-gating-dependencies for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
[00:45 UTC] zuul at gallium.wikimedia.org in /srv/org/wikimedia/integration ((no branch)%)


-------

$ cat /var/log/zuul/zuul.log.2014-10-15 | grep '166803,1'
2014-10-15 19:38:32,047 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 19:38:32,382 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0f64c10 166803,1> to <Pipeline check-only>
2014-10-15 19:38:32,405 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0f64c10 166803,1> to <Pipeline check-voter>
2014-10-15 19:38:32,406 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0f64c10 166803,1> to <Pipeline test>
2014-10-15 19:38:32,407 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes []
2014-10-15 19:38:34,115 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes []
2014-10-15 19:38:34,115 INFO zuul.Gearman: Launch job mwext-DonationInterface-lint for change <Change 0x7f17b0f64c10 166803,1> with dependent changes []
2014-10-15 19:38:34,119 INFO zuul.Gearman: Launch job mwext-DonationInterface-npm for change <Change 0x7f17b0f64c10 166803,1> with dependent changes []
2014-10-15 19:38:35,718 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes []
2014-10-15 19:38:35,718 INFO zuul.Gearman: Launch job mwext-DonationInterface-testextension for change <Change 0x7f17b0f64c10 166803,1> with dependent changes []
2014-10-15 19:38:48,199 INFO zuul.IndependentPipelineManager: Reporting change <Change 0x7f17b0f64c10 166803,1>, actions: [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': 2}>]
2014-10-15 19:38:48,273 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 19:40:23,214 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 19:40:23,668 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 23:47:29,998 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 23:47:30,326 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b0f64c10 166803,1> to <Pipeline gate-and-submit>
2014-10-15 23:47:30,327 INFO zuul.DependentPipelineManager: Reporting start, action [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': 0}>] change <Change 0x7f17b0f64c10 166803,1>
2014-10-15 23:47:30,402 INFO zuul.Gerrit: Updating information for 166803,1
2014-10-15 23:47:30,406 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,790 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,790 INFO zuul.Gearman: Launch job mediawiki-gate for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,794 INFO zuul.Gearman: Launch job mwext-DonationInterface-lint for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:47:31,798 INFO zuul.Gearman: Launch job mwext-DonationInterface-npm for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:48:30,564 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b13caed0 166889,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:50:13,976 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0f64c10 166803,1> depends on changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:50:13,976 INFO zuul.Gearman: Launch job mwext-DonationInterface-testextension for change <Change 0x7f17b0f64c10 166803,1> with dependent changes [<Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:59:51,805 INFO zuul.DependentPipelineManager: Resetting builds for change <Change 0x7f17b0c3d250 166918,2> because the item ahead, <QueueItem 0x7f180b290c10 for <Change 0x7f17b13caed0 166889,2> in gate-and-submit>, is not the nearest non-failing item, <QueueItem 0x7f17b10042d0 for <Change 0x7f17b0f64c10 166803,1> in gate-and-submit>
2014-10-15 23:59:51,805 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0c3d250 166918,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,864 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b0c3d250 166918,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,864 INFO zuul.Gearman: Launch job mwext-VisualEditor-qunit for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,868 INFO zuul.Gearman: Launch job mwext-VisualEditor-bundle-rubocop for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,876 INFO zuul.Gearman: Launch job mwext-VisualEditor-phpcs-HEAD for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,884 INFO zuul.Gearman: Launch job mediawiki-gate for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,888 INFO zuul.Gearman: Launch job mwext-VisualEditor-npm for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,893 INFO zuul.Gearman: Launch job mwext-VisualEditor-lint for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:55,957 INFO zuul.Gearman: Launch job mwext-VisualEditor-doc-test for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]
2014-10-15 23:59:56,023 INFO zuul.Gearman: Launch job experiment-gating-dependencies for change <Change 0x7f17b0c3d250 166918,2> with dependent changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>]

-------

$ cat /var/log/zuul/zuul.log.2014-10-15 | grep '166889,2'
2014-10-15 22:33:35,953 INFO zuul.Gerrit: Updating information for 166889,2
2014-10-15 22:33:36,261 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b13caed0 166889,2> to <Pipeline check-only>
2014-10-15 22:33:36,261 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b13caed0 166889,2> to <Pipeline check-voter>
2014-10-15 22:33:36,262 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b13caed0 166889,2> to <Pipeline test>
2014-10-15 22:33:36,263 INFO zuul.IndependentPipelineManager: Change <Change 0x7f17b13caed0 166889,2> depends on changes []
2014-10-15 22:33:39,803 INFO zuul.IndependentPipelineManager: Unable to merge change <Change 0x7f17b13caed0 166889,2>
2014-10-15 22:33:39,803 INFO zuul.IndependentPipelineManager: Reporting change <Change 0x7f17b13caed0 166889,2>, actions: [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': -1}>]
2014-10-15 22:33:39,876 INFO zuul.Gerrit: Updating information for 166889,2
2014-10-15 23:48:30,171 INFO zuul.Gerrit: Updating information for 166889,2
2014-10-15 23:48:30,484 INFO zuul.Scheduler: Adding mediawiki/extensions/DonationInterface, <Change 0x7f17b13caed0 166889,2> to <Pipeline gate-and-submit>
2014-10-15 23:48:30,484 INFO zuul.DependentPipelineManager: Reporting start, action [<ActionReporter <zuul.reporter.gerrit.Reporter object at 0x1b6c310>, {'verified': 0}>] change <Change 0x7f17b13caed0 166889,2>
2014-10-15 23:48:30,559 INFO zuul.Gerrit: Updating information for 166889,2
2014-10-15 23:48:30,564 INFO zuul.DependentPipelineManager: Change <Change 0x7f17b13caed0 166889,2> depends on changes [<Change 0x7f17b0f64c10 166803,1>, <Change 0x7f17b0ff0590 166678,3>, <Change 0x7f17b0c2e750 166607,6>]
2014-10-15 23:48:31,761 INFO zuul.DependentPipelineManager: Unable to merge change <Change 0x7f17b13caed0 166889,2>
2014-10-15 23:59:51,805 INFO zuul.DependentPipelineManager: Resetting builds for change <Change 0x7f17b0c3d250 166918,2> because the item ahead, <QueueItem 0x7f180b290c10 for <Change 0x7f17b13caed0 166889,2> in gate-and-submit>, is not the nearest non-failing item, <QueueItem 0x7f17b10042d0 for <Change 0x7f17b0f64c10 166803,1> in gate-and-submit>
Comment 1 Antoine "hashar" Musso (WMF) 2014-10-16 13:08:49 UTC
Looking at debug.log.2014-10-15


23:21:17
- receives a code review +2 event for change 166678,3
- triggers a merge:merge function

23:21:47 (30 seconds later):

 ERROR gear.Client.unknown:
   Connection <gear.Connection 0x1ca7f90 host: 208.80.154.135 port: 4730>
   timed out waiting for a response to a submit job request:
      <gear.Job 0x7f1809f1dd50
          handle: None
          name: merger:merge
          unique: d928090c54474fa59b51d1661263750b>
 NoConnectedServersError: No connected Gearman servers

Zuul forks to spawn a Gearman server and both process communicate over TCP 4730 via the host public IP.

The Zuul code that handles the creation of the reference (zuul.scheduler.BasePipelineManager.prepareRef() ) does not handle exceptions when submitting the job to Gearman, so it is raised up to the main loop and leave the enqueued change in an inconsistent state.


I am not familiar with that part of the code.  I guess exceptions should be caught and some internal status (there is a `ready` variable) be set accordingly.


To unblock such issue either:

A) potentially a new patchset could be sent (that would remove the previous one from the Zuul queues) then +2ed again to reenter the gate-and-submit pipeline

B) use the Zuul client to move the change ahead in the queue (an action known as 'promote') which would reorder the changes in the pipeline and retrigger merge jobs. Ie:

 zuul promote --pipeline gate-and-submit --changes 166678,3

See doc at http://ci.openstack.org/zuul/client.html#usage


Adjusting priority, we can't have everything highest/critical.
Comment 2 Antoine "hashar" Musso (WMF) 2014-10-16 13:27:58 UTC
Turns out I already filled a bug upstream in August.  The case was the connection to Gearman timing out while preparing a reference (same code path): https://bugs.launchpad.net/zuul/+bug/1358517

I have replied there with a potential solution to the issue.


Marking bug as an upstream issue.
Comment 3 Antoine "hashar" Musso (WMF) 2014-10-20 20:41:13 UTC
I have proposed a change at https://review.openstack.org/#/c/128921/ with a test which most probably has a race condition :D
Comment 4 Antoine "hashar" Musso (WMF) 2014-10-30 15:55:08 UTC
I am now deploying https://review.openstack.org/#/c/128921/3 by merging in our Zuul fork at integration/zuul.git
Comment 5 Antoine "hashar" Musso (WMF) 2014-10-30 21:03:26 UTC
The patch has been deployed. Lets monitor the issue for a few days.
Comment 6 Antoine "hashar" Musso (WMF) 2014-11-01 15:41:29 UTC
The patch solved the issues of patches being stuck in the event queue whenever the connection is lost with the Gearman server. That happened yesterday on the Wikimedia setup, on reestablishing the connection, all the pending merger:merge function have been executed and the event queue processed everything :-)

Keeping bug open until the patch is approved by upstream.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links