Last modified: 2014-11-01 15:41:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65760, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63760 - Jobs are sometime no more being triggered by Zuul / Jenkins
Jobs are sometime no more being triggered by Zuul / Jenkins
Status: NEW
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Antoine "hashar" Musso (WMF)
:
: 69045 70256 (view as bug list)
Depends on: 72113
Blocks: 69045
  Show dependency treegraph
 
Reported: 2014-04-10 09:15 UTC by Antoine "hashar" Musso (WMF)
Modified: 2014-11-01 15:41 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Zuul events spike (7.48 KB, image/png)
2014-06-07 08:32 UTC, Antoine "hashar" Musso (WMF)
Details

Description Antoine "hashar" Musso (WMF) 2014-04-10 09:15:10 UTC
From time to time, some subsets of jobs are no more being executed. Zuul does enqueue them properly as can be seen on https://integration.wikimedia.org/zuul/ when the issue occurs.

The Jenkins queue is idling with target hosts not running any tests.

An example of a stuck job is:

 $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
 build:integration-jjb-config-test	2	0	14
 build:integration-jjb-config-test:contintLabsSlave	0	0	14
 $

Where the numbers are Total, Running, Workers.  The status page shows two jobs being stuck.


Another occurrence:

 $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
 build:apps-android-wikipedia-tox-flake8	17	0	14
 build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
 $

And there is indeed 17 such jobs being stuck.


Suspicion: both jobs are tied to the node label contintLabsSlave. Either Zuul apparently asked to run the labelless function which got properly enqueued by the Gearman server.  Since the job has a label, the labelless function is never being processed by the Jenkins Gearman plugin.
Comment 1 Antoine "hashar" Musso (WMF) 2014-04-10 09:26:34 UTC
Once slaves are disconnected I get:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave	0	0	0
build:integration-jjb-config-test	2	0	0

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	22	0	0
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	0


It did process a few jobs but got stuck again:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave	0	0	14
build:integration-jjb-config-test	2	0	14

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	16	0	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
Comment 2 Antoine "hashar" Musso (WMF) 2014-04-10 09:34:22 UTC
Disconnecting and reconnecting the gearman client does unleash a few jobs.

Disconnecting and reconnecting a slave does unleash them as well.


Here the debug output whenever I disconnected and reconnected integration-slave1002.eqiad.wmflabs


hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	12	2	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	11	1	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	10	0	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	10	0	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	9	2	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
hashar@gallium:~$ 


It eventually managed to run them all.
Comment 3 Antoine "hashar" Musso (WMF) 2014-04-16 18:20:56 UTC
I have upgraded Zuul wmf-deploy-20140122..wmf-deploy-20140416-3 . That might fix it.
Comment 4 Antoine "hashar" Musso (WMF) 2014-04-29 10:15:57 UTC
We got python-gear upgraded from 0.4.0 to 0.5.4 which fix a bunch of function registrations errors in Gearman.  That might solve the issue.
Comment 5 Antoine "hashar" Musso (WMF) 2014-05-19 11:00:49 UTC
Seems it is no more occurring now.
Comment 6 Antoine "hashar" Musso (WMF) 2014-05-23 15:44:20 UTC
That occurred again today around noon UTC. Jenkins/Zuul restarted at around 14:17 UTC  :-(
Comment 7 Antoine "hashar" Musso (WMF) 2014-05-28 16:48:39 UTC
Crashed again on May 28th during european afternoon.

Jobs meant to be run on labs instances ended up not being registered anymore with the Zuul Gearman server.   That must be a bug in the Jenkins Gearman plugin :-(  {{bug|63760}}
Comment 8 Antoine "hashar" Musso (WMF) 2014-05-30 09:24:20 UTC
Another occurrence:


hashar@gallium:~$ echo status|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle
build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave	0	0	10
build:apps-android-wikipedia-maven-checkstyle	10	0	10

numbers are Total, Running, Workers.


And there are working function indeed:


hashar@gallium:~$ echo workers|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle|cut -b1-50
54 127.0.0.1 integration-slave1002_exec-3 : build:
53 127.0.0.1 integration-slave1002_exec-1 : build:
55 127.0.0.1 integration-slave1002_exec-4 : build:
56 127.0.0.1 integration-slave1002_exec-0 : build:
57 127.0.0.1 integration-slave1002_exec-2 : build:
14 127.0.0.1 integration-slave1001_exec-0 : build:
19 127.0.0.1 integration-slave1001_exec-3 : build:
21 127.0.0.1 integration-slave1001_exec-4 : build:
22 127.0.0.1 integration-slave1001_exec-2 : build:
28 127.0.0.1 integration-slave1001_exec-1 : build:

The functions registered:

 build:apps-android-wikipedia-maven-checkstyle 
 build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave



WORKAROUND: disconnect and reconnect the labs slaves.
Comment 9 Antoine "hashar" Musso (WMF) 2014-06-07 08:32:18 UTC
Created attachment 15589 [details]
Zuul events spike

I noticed earlier this week Zuul being trapped in some loop.  Upstream has noticed it as well from time to time but never managed to track it down.   Attached is a graph showing the spike of events on June 6th which is caused by the death loop.
Comment 10 Antoine "hashar" Musso (WMF) 2014-08-02 14:10:09 UTC
*** Bug 69045 has been marked as a duplicate of this bug. ***
Comment 11 Antoine "hashar" Musso (WMF) 2014-09-16 07:51:03 UTC
*** Bug 70256 has been marked as a duplicate of this bug. ***
Comment 12 Antoine "hashar" Musso (WMF) 2014-09-29 15:59:56 UTC
Documented a workaround on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues


The gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:

head to https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
search for "Gearman"
uncheck "Enable Gearman"
Save at the bottom
search for "Gearman"
check "Enable Gearman"
Save at the bottom
Comment 13 Antoine "hashar" Musso (WMF) 2014-10-21 16:15:45 UTC
That is related to bug 63758 (JJB created jobs not registering).

I have upgraded Jenkins Gearman plugin to fix jobs registrations:
* cherry picked https://review.openstack.org/#/c/125755/ patchset 8
* compiled it via maven
* uploaded and restarted Jenkins


That bumps gearman plugin with support for the Jenkins LTS version we are using which is probably going to help.



I found out another issue that causes Gearman server to lock completely waiting for data to be received on a socket. Filled upstream as https://bugs.launchpad.net/gear/+bug/1381565
Comment 14 Antoine "hashar" Musso (WMF) 2014-11-01 15:41:16 UTC
The root cause is that the Gearman server no more response for an unknown reason.

When reconnecting it (see comment #12) the jobs were still stuck in the queue due to a bug in Zuul. That is bug 72113 and the patch I wrote is applied on our Zuul and confirmed to work (merge functions are now properly retriggered when Gearman comes back).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links