Last modified: 2013-12-18 20:33:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50025, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48025 - Zuul is ultra slow post Jenkins upgrade
Zuul is ultra slow post Jenkins upgrade
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
wmf-deployment
All All
: High major (vote)
: ---
Assigned To: Antoine "hashar" Musso (WMF)
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-02 22:19 UTC by Antoine "hashar" Musso (WMF)
Modified: 2013-12-18 20:33 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Apache log showing Zuul updating build descriptions (15.72 KB, text/plain)
2013-05-09 20:33 UTC, Antoine "hashar" Musso (WMF)
Details
jstack -F -m -l using the PID of the stalled thread (60.90 KB, text/plain)
2013-05-10 08:30 UTC, Antoine "hashar" Musso (WMF)
Details
jmap using the PID of the stalled thread (1.30 KB, text/plain)
2013-05-10 08:30 UTC, Antoine "hashar" Musso (WMF)
Details
Jenkins threaddump 1 (57.66 KB, text/plain)
2013-05-10 09:00 UTC, Antoine "hashar" Musso (WMF)
Details
Jenkins threaddump 2 (63.69 KB, text/plain)
2013-05-10 09:02 UTC, Antoine "hashar" Musso (WMF)
Details

Description Antoine "hashar" Musso (WMF) 2013-05-02 22:19:39 UTC
Jenkins was slow to restart (bug 47120) and got upgraded.  Since them Zuul has been very slow to trigger the build.

After investigating with upstream (James E. Blair from OpenStack), it turns out Zuul wait for the results of JSON API calls which are meant to verify whether a job exist before triggering it.

We have found a few API calls that took 2 to 4 minutes which would lock Zuul internal scheduler. In the end, it would be spending most of its time waiting.

This is a major issue.


A workaround is to skip the API call. James advised to fake the ExtendedJenkins class in Zuul to pretends jobs always exist.  

https://gerrit.wikimedia.org/r/#/c/62095/
Comment 1 Antoine "hashar" Musso (WMF) 2013-05-02 22:20:01 UTC
yeah for working over night :-]
Comment 2 Antoine "hashar" Musso (WMF) 2013-05-02 23:01:55 UTC
The workaround did get rid of the calls to the slow API.

Looking at the Apache logs in /var/log/apache2/integration_access.log , I have noticed it took up to 8 seconds to update a job description. That is something being done by Zuul very often and prevents it from doing anything else.

Jenkins had at least a thread at 100% CPU and the main process (java) was reporting 100% CPU too.   I have restarted Jenkins.
Comment 3 Antoine "hashar" Musso (WMF) 2013-05-02 23:21:52 UTC
Tips to investigate:

tail -f /var/log/apache2/integration_access.log | grep Python

That will show the Zuul requests to Jenkins.  The time is when the query started, so any backward time will indicated a query that took a long time.  On completion, query send a 302 which in turns triggers a GET from Jenkins.

Exemple:

 02/May/2013:23:17:53
 "POST /ci/job/parsoid-parsertests-run-harder/65/submitDescription 302"
 02/May/2013:23:17:53
 "GET /ci/job/parsoid-parsertests-run-harder/65/ 200"

That one has been fast.


Before triggering a job, Zuul attempts to lookup whether the job exist. It does that based on the python-jenkins module which hit an API call which is very slow on our setup.    I have overridden the existence check with Gerrit change #62095.


For now it seems Jenkins is more or less working after the last restart.
Comment 4 Antoine "hashar" Musso (WMF) 2013-05-06 19:25:07 UTC
This got fixed for now after a looong night of tweaking.

The API calls were hitting the full console log which is definitely slow. That one got fixed by Ori and merged upstream. https://review.openstack.org/#/c/28128/

The Jenkins API call was hitting an URL that triggered a full parse of the build history for that jobs. The workaround as a live hack was to let Zuul pretend the job always exist. https://gerrit.wikimedia.org/r/#/c/62095/

Two threads in Jenkins were waiting for each other (as I understand it) causing overall slowness. That got resolved after yet another restart.


For now it seems to be stable. We will see how bad it is after a restart and a slow start, maybe something to try out in the morning.

Lowering priority for now.
Comment 5 Antoine "hashar" Musso (WMF) 2013-05-09 20:28:56 UTC
There is another issue that have triggered a few time this week which is that updating the job description is suddenly slow.

Ganglia shows a CPU at 100%, top -i then H shows a jenkins thread at 100%

Looking at the apache proxy that handles Jenkins access, one can monitor what Zuul is doing with Jenkins:


 tail -f /var/log/apache2/integration_access.log|grep Python-urllib

It shows Zuul update the build descriptions with requests such as:

  POST /ci/job/mediawiki-core-qunit/3359/submitDescription  302
  GET /ci/job/mediawiki-core-qunit/3359/  200

They take from 3 to 12 seconds and even more sometime. That is the root cause of slowness.


Whenever a build is started or emit a result, Zuul will update the description of all the builds that have been triggered by that change.  So given a pipeline of 10 jobs, that is potentially a lot of description update.  Zuul is also locked waiting for Jenkins to answer which in turns make it very slow to process events.
Comment 6 Antoine "hashar" Musso (WMF) 2013-05-09 20:33:58 UTC
Created attachment 12281 [details]
Apache log showing Zuul updating build descriptions

The attached log file shows Zuul updating the job description in jenkins. That is quite a lot of http requests. In this case they are fast though.
Comment 7 Gerrit Notification Bot 2013-05-09 20:52:12 UTC
Related URL: https://gerrit.wikimedia.org/r/63064 (Gerrit Change Ie33e1adb8d9facab8c57376cdd418baf0d8a208a)
Comment 8 Antoine "hashar" Musso (WMF) 2013-05-09 21:17:46 UTC
https://gerrit.wikimedia.org/r/63068 enable Jenkins access log to track the requests made by Zuul

https://gerrit.wikimedia.org/r/63069 makes Zuul access Jenkins via http://127.0.0.1 instead of the public interface. That saves ssl negotiation and the apache overhead.
Comment 9 Antoine "hashar" Musso (WMF) 2013-05-09 21:48:30 UTC
We now have Jenkins access.log and Zuul querying on localhost directly.
Comment 10 Antoine "hashar" Musso (WMF) 2013-05-09 22:00:52 UTC
I have pinged wikitech-l about the two crashes that happened today related to that issue : http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069261.html
Comment 11 Antoine "hashar" Musso (WMF) 2013-05-10 08:26:36 UTC
This happened again overnight. Will attach some debugging output.
Comment 12 Antoine "hashar" Musso (WMF) 2013-05-10 08:30:05 UTC
Created attachment 12286 [details]
jstack -F -m -l  using the PID of the stalled thread
Comment 13 Antoine "hashar" Musso (WMF) 2013-05-10 08:30:57 UTC
Created attachment 12287 [details]
jmap using the PID of the stalled thread
Comment 14 Antoine "hashar" Musso (WMF) 2013-05-10 09:00:53 UTC
Created attachment 12288 [details]
Jenkins threaddump 1
Comment 15 Antoine "hashar" Musso (WMF) 2013-05-10 09:02:31 UTC
Created attachment 12289 [details]
Jenkins threaddump 2

Jenkins has an internal thread dumper which is available via the web interface https://integration.wikimedia.org/ci/threadDump and dumped in the /var/log/jenkins/jenkins.log file.  The stalled thread is still 24524

The two thread dumps attached have been taken at 10 minutes interval.
Comment 16 Antoine "hashar" Musso (WMF) 2013-05-10 09:06:42 UTC
The stalled thread as the id 24524 (or 0x5fcc ). That shows in both thread dumps as:

"VM Thread" prio=10 tid=0x00007ff398079000 nid=0x5fcc runnable
Comment 17 Antoine "hashar" Musso (WMF) 2013-05-10 10:23:45 UTC
Last week Timo had the Java Melody plugin installed on Jenkins which let us monitor the java process. The page entry is https://integration.wikimedia.org/ci/monitoring

We can get heap history via https://integration.wikimedia.org/ci/monitoring?part=heaphisto
Comment 18 Antoine "hashar" Musso (WMF) 2013-05-10 10:55:02 UTC
After a bit more investigation, our build.xml files contains references to org.jvnet.hudson.plugins.DownstreamBuildViewAction which was a plugin to display the dependencies between builds.  When upgrading Jenkins I have removed that plugin but the build files still contains references to it such as:


    <org.jvnet.hudson.plugins.DownstreamBuildViewAction>
      <build class="build" reference="../../.."/>
    </org.jvnet.hudson.plugins.DownstreamBuildViewAction>

When Jenkins parse the build history, it will detect this is no more used and will log a message which is stored in memory.  The data can be discarded manually  via the 'Manage old data' https://integration.wikimedia.org/ci/administrativeMonitor/OldData/manage

With the thousands of build history, the old data store turns out to have too many item for Java heap size and that most probably cause the memory exhaustion.


The slow way to solve that is to ask Jenkins to parse the history files for a job then manually discard all data.   The fastest / hard way would be to parse all the build.xml and remove the XML snippet.
Comment 19 Antoine "hashar" Musso (WMF) 2013-05-10 11:33:11 UTC
And the super long filtering command:


find . -name build.xml \
 -exec fgrep -q 'DownstreamBuild' {} \; \
 -exec sed -i-back \
 '/    <org.jvnet.hudson.plugins/,/\/org.jvnet.hudson.plugins.DownstreamBuildViewAction>/d' {} \;
Comment 20 Antoine "hashar" Musso (WMF) 2013-05-10 12:05:20 UTC
running above tool on mediawiki-core-lint
Comment 21 Antoine "hashar" Musso (WMF) 2013-05-10 12:23:56 UTC
Launched for all jobs using  find . -wholename '*/builds/*/builds.xml'

That is running in a tmux window on gallium.
Comment 22 Antoine "hashar" Musso (WMF) 2013-05-10 17:09:55 UTC
All build.xml now have the org.jvnet.hudson.plugins.DownstreamBuildViewAction snippet removed.
Comment 23 Antoine "hashar" Musso (WMF) 2013-05-10 19:14:54 UTC
Sent a summary on wikitech-l http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069303.html

I will monitor this bug till monday included then probably close this.
Comment 24 Antoine "hashar" Musso (WMF) 2013-05-13 21:30:57 UTC
That issue is solved for now. The build.xml rewrite fixed it.
Comment 25 Gerrit Notification Bot 2013-10-01 05:39:15 UTC
Change 86812 had a related patch set uploaded by Matanya:
Removed pstack package since bug 48025 is resolved.

https://gerrit.wikimedia.org/r/86812
Comment 26 Antoine "hashar" Musso (WMF) 2013-10-01 08:13:32 UTC
Unrelated change.
Comment 27 Gerrit Notification Bot 2013-10-01 08:18:11 UTC
Change 86812 merged by Dzahn:
Removed pstack package since bug 48025 is resolved.

https://gerrit.wikimedia.org/r/86812
Comment 28 Gerrit Notification Bot 2013-12-18 17:17:47 UTC
Change 102465 had a related patch set uploaded by Hashar:
WMF: less build description updates

https://gerrit.wikimedia.org/r/102465
Comment 29 Gerrit Notification Bot 2013-12-18 17:18:48 UTC
Change 102465 merged by Hashar:
WMF: less build description updates

https://gerrit.wikimedia.org/r/102465
Comment 30 Antoine "hashar" Musso (WMF) 2013-12-18 17:49:33 UTC
Gerrit change #102465 deployed in production and apparently definitely solved the issue.

Summary sent on wikitech-l http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073703.html

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links