Last modified: 2013-06-10 15:03:00 UTC
Since at 2013-04-05 04:29:46 UTC, Zuul no more receives events from Gerrit. It uses a CLI command such as: ssh -p 29418 jenkins-bot@manganese.wikimedia.org gerrit stream-events I did stream-events using both the jenkins-bot user and my own account 'hashar'. While submitting a new patchset or posting a comment: nothing is reported. The end effect is that no jobs are triggered in Jenkins.
This is a major regression in our development workflow.
Gerrit has been upgraded with https://gerrit.wikimedia.org/r/#/c/57508/ Which got merged on Apr 4, 2013 20:29 (UTC). Apparently Gerrit got restarted at 20:44 (UTC), the Zuul log shows errors 500 at that time. The last event received by Zuul was on 2013-04-05 04:26:56 (UTC) then the JSON events cant be decoded / are not received: 2013-04-05 04:29:46,092 ERROR gerrit.GerritWatcher: Exception on ssh event stream: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run self._listen(stdout, stderr) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen self._read(stdout) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read data = json.loads(l) File "/usr/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded
Restarted Zuul just in case: nothing. Definitely on Gerrit side.
Ariel restarted Gerrit on manganese. That unbroken the stream-events thus cleaning the issue. I am leaving this bug open for monitoring / investigation. There must be some regression in Gerrit since we upgraded it yesterday.
(In reply to comment #4) > Ariel restarted Gerrit on manganese. That unbroken the stream-events thus > cleaning the issue. Thanks. Lowering priority.
Restarted again, since stram-events hung. The output from the queue is kind of weird & concerning: $ ssh gerrit.wikimedia.org gerrit show-queue -w Task State Command ------------------------------------------------------------------------------ 9c0dc448 com.google.gerrit.sshd.commands.StreamEvents$3@6435207f 85d6cd95 com.google.gerrit.sshd.commands.StreamEvents$3@5be8a53e c54125b8 com.google.gerrit.sshd.commands.StreamEvents$3@6b8b896 a0d428db com.google.gerrit.sshd.commands.StreamEvents$3@495e834c 0323c3c0 com.google.gerrit.sshd.commands.StreamEvents$3@5bfe1835 49cc90ac com.google.gerrit.sshd.commands.StreamEvents$3@6e76ec2 e9bf84a4 com.google.gerrit.sshd.commands.StreamEvents$3@44188618 4617a0ff com.google.gerrit.sshd.commands.StreamEvents$3@260bb495 5e8f7686 com.google.gerrit.sshd.commands.StreamEvents$3@4a7cb5b6 fe9f4ab9 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@40811f5d 3e416208 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7194d12c 1e449e17 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5d3233c0 de9a86cc waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7efe402f 9ea40e12 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@ae3b20d bea9d2e8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6abe7910 be3472ba waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6130d9f5 5e4a1628 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4e3595ac 1915a152 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@417de5ee 3943c55a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ff57751 9938b1f3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a31e6f7 f911ad68 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2329064 593eb9dd waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@f04a851 9982f1fc waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@30828dd9 665afa9e waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2c6c773b 066e9e41 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@507d946 23f714d9 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4b2eba70 0ff474d2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@68d2359f 6fc0906b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@40be8f7c 835cc5a8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ebe3719 63619163 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a135a13 c3563d8b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2c7ec398 a35b09b3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@365b46fa 0370d533 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2290e2f7 e36ca160 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5624d7db 436a4d46 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@cd7970c 23671980 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1dfd9b6a 63fd11f8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@74e65b87 83fda551 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3f833806 5d722e74 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3fd3327d bd77aa62 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4297cd47 9d7c2648 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7c261278 bdab2a74 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2786c3f5 baa810b1 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@ab1e19 7f44cb17 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6244fba6 bf4ec337 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@194923d6 df493731 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@10cfc392 ff2f5b61 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4649853e 1f334fd3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4cacd7f 7f160b2c waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ee581b1 046d445a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6da59b4f 64b8c066 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@137797c6 848db42c waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@108d5226 4926f573 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2e0c4547 8bfdc551 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1e794f4e 2b8db900 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2e459983 6b02d146 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@27984fef cb073d54 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5b617f30 ab0c4979 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5de5572e eb0d6177 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@45a5965c 0b11d594 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7a3295bc 783069a2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@a7e838 46810952 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@f7eb4de 267e5542 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7bbebd2a 868b8174 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@27754fde c694199b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6ce8c0f 6688cd79 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@624d3046 c6e3b962 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@479462c4 06f1d196 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5ee05692 554535c0 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3830aedd b235a762 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4284bfd f23f9f40 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@399c7615 923ae350 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4f0c7a3f d244dbcb waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7ac6658a 3249b7e2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@773485fe 7253af98 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3c2de6a4 1246f3d4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@698137d9 5250eb8b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a53bbe7 5288eb78 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2f556905 cc3d20e4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1061df8a 2c32bc16 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2279bdac 0c2fb8ac waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@669378e2 0c5a59a0 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@66280ecb 6c4e75e4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4fe365d2 4c4b71d4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@99ae56d f650f82e waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@184c5c08 e71cc87a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3d2e1dae d36dcb8e 12:18:22.186 push gerritslave@formey.wikimedia.org:/var/lib/gerrit2/review_site/git/mediawiki/extensions/FundraisingEmailUnsubscribe.git da7faee1 19:42:45.214 Log File Compressor ------------------------------------------------------------------------------ 88 tasks
The Zuul log is full of the following: 2013-04-06 17:02:55,441 ERROR gerrit.GerritWatcher: Exception on ssh event stream: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run self._listen(stdout, stderr) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen self._read(stdout) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read data = json.loads(l) File "/usr/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded
Changing back to git/gerrit. Zuul has no issue, it is Gerrit holding the events and not emitting them.
Stuck again but this time show-queue gives me: Task State Command ------------------------------------------------------------------------------ 5229a073 git-upload-pack '/mediawiki/extensions/VisualEditor' (shirayuki) aad42f27 07:18:57.201 inside-wmf/mediawiki/extensions/FundraisingEmailUnsubscribe ------------------------------------------------------------------------------ Logged a message on IRC about it and pinged ops list.
Although I cannot reliably reproduce the problem, the culprit seems to be gerrit's version bump of Apache SSHD to 0.6.0, which comes with closing of idle ssh connections. Upon 10 minutes of inactivity, gerrit disconnects Zuul, and Zuul has to reconnect. Gerrit accepts the connection, but sometimes refuses to stream events over the new connection. There seems to be some race condition in gerrit when Zuul reconnects. We can mitigate the problem by having at least 1 commit every 10 minutes... Go, programmers go :-)) To also have a solution until we get that commit density 24/7, https://gerrit-review.googlesource.com/#/c/44351/ got submitted to upstream gerrit. (We cannot simply downgrade gerrit to the last known good version, as that would require database schema downgrading, which currently is not supported by gerrit)
Raising importance, this needs fixing ASAP. We need testing. Not just to save work, but because we (and by design and intention) rely on it automatically happen and performing the merge action on approval. Also the lack of postmerge events is blocking (auto-deploy for certain projects).
(In reply to comment #10) > https://gerrit-review.googlesource.com/#/c/44351/ > got submitted to upstream gerrit. > It hasn't been reviewed / merged yet however. How long are we willing to wait? Can we apply it to our fork for now? Jenkins being down is causing lots of workflow problems throughout the different projects.
For the context: Zuul is a python daemon running on gallium. It establish a ssh connection as jenkins-bot on manganese.wikimedia.org and runs the 'gerrit stream-events' command. It expect JSON object to be returned. Looking at the Zuul log (fgrep GerritWatcher /var/log/zuul/zuul.log), it does emit a stacktrace every 10 minutes: 2013-04-08 05:50:48,028 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:00:53,176 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:10:58,266 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:21:03,383 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:31:08,472 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:41:13,594 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 06:51:18,687 ERROR gerrit.GerritWatcher: Exception on ssh event stream: 2013-04-08 07:01:23,782 ERROR gerrit.GerritWatcher: Exception on ssh event stream: A full trace is: 2013-04-08 00:12:10,016 ERROR gerrit.GerritWatcher: Exception on ssh event stream: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run self._listen(stdout, stderr) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen self._read(stdout) File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read data = json.loads(l) File "/usr/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded Which is stream-events sending some invalid JSON, most probably something like "Connection timed out after 10 minutes". It seems Zuul properly reconnect the ssh connection. Whenever the stream-events for Zuul, it is also blocked for any other user even on a fresh new connection. For example, I could run the stream-events command using my own account, fill a comment in Gerrit and I get no event on the channel.
Related URL: https://gerrit.wikimedia.org/r/58059 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)
Related URL: https://gerrit.wikimedia.org/r/58060 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)
Related URL: https://gerrit.wikimedia.org/r/58063 (Gerrit Change I431ff970596aae03b3ab5276303ed18574892b13)
Until we can deploy a new gerrit build, we prevent timeouts by forcing a recheck on https://gerrit.wikimedia.org/r/#/c/58060/ every 4 minutes.
(In reply to comment #12) > (In reply to comment #10) > > https://gerrit-review.googlesource.com/#/c/44351/ > > got submitted to upstream gerrit. > > > > It hasn't been reviewed / merged yet however. How long are we willing to > wait? > Can we apply it to our fork for now? > > Jenkins being down is causing lots of workflow problems throughout the > different projects. We don't have a fork, we use vanilla Gerrit :) But yes, I'm looking into deploying this now rather than waiting on upstream to merge.
Deployed the upstream patch just now.
*** Bug 47004 has been marked as a duplicate of this bug. ***
Lowering priority and severity since we have a workaround now.
(In reply to comment #21) > Lowering priority and severity since we have a workaround now. Now actually doing so for priority too.
*** Bug 49330 has been marked as a duplicate of this bug. ***
We previously rolled back to a gerrit that exposed the problem again :-( In case we ever do this again: To keep the zuul<->gerrit connection alive, it suffices to have at least one event fed to zuul each 10 minutes. So for example clone our gerrit-ping repo. Restore https://gerrit.wikimedia.org/r/#/c/58060/ Fetch patch set 1. And run while true ; do date ; ssh gerrit.wikimedia.org gerrit review -m "'recheck'" --verified 0 75be73703978e75a25d5e3bb5685c51580af3060 ; sleep 240 ; done in a shell (I had ssh configured to automatically supply username and gerrit port). Gerrit 2.6-rc0-154-gfcdb34b is the first version that we deployed and contained the fix.
(In reply to comment #24) > We previously rolled back to a gerrit that exposed the problem again :-( > To expand on this, last week when we tried upgrading to 2.7-rc1-424-gef469ac we hit a bunch of problems so we rolled back. However, I rolled too far back to 2.6-rc0-144-gb1dadd2.war instead of 2.6-rc0-154-gfcdb34b which lacked the SSH timeout fix mentioned above.