Last modified: 2013-06-10 15:03:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48917, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46917 - Zuul: Gerrit's ssh event stream unavailable
Zuul: Gerrit's ssh event stream unavailable
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Git/Gerrit (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 49330 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-05 08:58 UTC by Antoine "hashar" Musso (WMF)
Modified: 2013-06-10 15:03 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Antoine "hashar" Musso (WMF) 2013-04-05 08:58:04 UTC
Since at 2013-04-05 04:29:46 UTC, Zuul no more receives events from Gerrit. It uses a CLI command such as:

 ssh -p 29418  jenkins-bot@manganese.wikimedia.org gerrit stream-events

I did stream-events using both the jenkins-bot user and my own account 'hashar'. While submitting a new patchset or posting a comment: nothing is reported.

The end effect is that no jobs are triggered in Jenkins.
Comment 1 Siebrand Mazeland 2013-04-05 09:14:49 UTC
This is a major regression in our development workflow.
Comment 2 Antoine "hashar" Musso (WMF) 2013-04-05 09:32:28 UTC
Gerrit has been upgraded with https://gerrit.wikimedia.org/r/#/c/57508/
Which got merged on Apr 4, 2013 20:29 (UTC).

Apparently Gerrit got restarted at 20:44 (UTC), the Zuul log shows errors 500 at that time.

The last event received by Zuul was on 2013-04-05 04:26:56 (UTC) then the JSON events cant be decoded / are not received:

2013-04-05 04:29:46,092 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
    self._listen(stdout, stderr)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
    self._read(stdout)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
    data = json.loads(l)
  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Comment 3 Antoine "hashar" Musso (WMF) 2013-04-05 09:33:24 UTC
Restarted Zuul just in case: nothing. Definitely on Gerrit side.
Comment 4 Antoine "hashar" Musso (WMF) 2013-04-05 09:40:19 UTC
Ariel restarted Gerrit on manganese. That unbroken the stream-events thus cleaning the issue.

I am leaving this bug open for monitoring / investigation.  There must be some regression in Gerrit since we upgraded it yesterday.
Comment 5 Andre Klapper 2013-04-05 10:16:31 UTC
(In reply to comment #4)
> Ariel restarted Gerrit on manganese. That unbroken the stream-events thus
> cleaning the issue.

Thanks. Lowering priority.
Comment 6 Chad H. 2013-04-06 12:18:35 UTC
Restarted again, since stram-events hung. The output from the queue is kind of weird & concerning:

$ ssh gerrit.wikimedia.org gerrit show-queue -w
Task     State                 Command
------------------------------------------------------------------------------
9c0dc448                       com.google.gerrit.sshd.commands.StreamEvents$3@6435207f
85d6cd95                       com.google.gerrit.sshd.commands.StreamEvents$3@5be8a53e
c54125b8                       com.google.gerrit.sshd.commands.StreamEvents$3@6b8b896
a0d428db                       com.google.gerrit.sshd.commands.StreamEvents$3@495e834c
0323c3c0                       com.google.gerrit.sshd.commands.StreamEvents$3@5bfe1835
49cc90ac                       com.google.gerrit.sshd.commands.StreamEvents$3@6e76ec2
e9bf84a4                       com.google.gerrit.sshd.commands.StreamEvents$3@44188618
4617a0ff                       com.google.gerrit.sshd.commands.StreamEvents$3@260bb495
5e8f7686                       com.google.gerrit.sshd.commands.StreamEvents$3@4a7cb5b6
fe9f4ab9 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@40811f5d
3e416208 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7194d12c
1e449e17 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5d3233c0
de9a86cc waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7efe402f
9ea40e12 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@ae3b20d
bea9d2e8 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@6abe7910
be3472ba waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@6130d9f5
5e4a1628 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4e3595ac
1915a152 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@417de5ee
3943c55a waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1ff57751
9938b1f3 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5a31e6f7
f911ad68 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2329064
593eb9dd waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@f04a851
9982f1fc waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@30828dd9
665afa9e waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2c6c773b
066e9e41 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@507d946
23f714d9 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4b2eba70
0ff474d2 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@68d2359f
6fc0906b waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@40be8f7c
835cc5a8 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1ebe3719
63619163 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5a135a13
c3563d8b waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2c7ec398
a35b09b3 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@365b46fa
0370d533 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2290e2f7
e36ca160 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5624d7db
436a4d46 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@cd7970c
23671980 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1dfd9b6a
63fd11f8 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@74e65b87
83fda551 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@3f833806
5d722e74 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@3fd3327d
bd77aa62 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4297cd47
9d7c2648 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7c261278
bdab2a74 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2786c3f5
baa810b1 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@ab1e19
7f44cb17 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@6244fba6
bf4ec337 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@194923d6
df493731 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@10cfc392
ff2f5b61 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4649853e
1f334fd3 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4cacd7f
7f160b2c waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1ee581b1
046d445a waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@6da59b4f
64b8c066 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@137797c6
848db42c waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@108d5226
4926f573 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2e0c4547
8bfdc551 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1e794f4e
2b8db900 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2e459983
6b02d146 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@27984fef
cb073d54 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5b617f30
ab0c4979 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5de5572e
eb0d6177 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@45a5965c
0b11d594 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7a3295bc
783069a2 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@a7e838
46810952 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@f7eb4de
267e5542 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7bbebd2a
868b8174 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@27754fde
c694199b waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@6ce8c0f
6688cd79 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@624d3046
c6e3b962 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@479462c4
06f1d196 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5ee05692
554535c0 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@3830aedd
b235a762 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4284bfd
f23f9f40 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@399c7615
923ae350 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4f0c7a3f
d244dbcb waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@7ac6658a
3249b7e2 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@773485fe
7253af98 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@3c2de6a4
1246f3d4 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@698137d9
5250eb8b waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@5a53bbe7
5288eb78 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2f556905
cc3d20e4 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@1061df8a
2c32bc16 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@2279bdac
0c2fb8ac waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@669378e2
0c5a59a0 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@66280ecb
6c4e75e4 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@4fe365d2
4c4b71d4 waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@99ae56d
f650f82e waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@184c5c08
e71cc87a waiting ....          com.google.gerrit.sshd.commands.StreamEvents$3@3d2e1dae
d36dcb8e 12:18:22.186          push gerritslave@formey.wikimedia.org:/var/lib/gerrit2/review_site/git/mediawiki/extensions/FundraisingEmailUnsubscribe.git
da7faee1 19:42:45.214          Log File Compressor
------------------------------------------------------------------------------
  88 tasks
Comment 7 Krinkle 2013-04-06 17:08:55 UTC
The Zuul log is full of the following:

2013-04-06 17:02:55,441 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
    self._listen(stdout, stderr)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
    self._read(stdout)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
    data = json.loads(l)
  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Comment 8 Antoine "hashar" Musso (WMF) 2013-04-07 07:13:58 UTC
Changing back to git/gerrit. Zuul has no issue, it is Gerrit holding the events and not emitting them.
Comment 9 Antoine "hashar" Musso (WMF) 2013-04-07 07:29:02 UTC
Stuck again but this time show-queue gives me:

Task     State                 Command
------------------------------------------------------------------------------
5229a073                       git-upload-pack '/mediawiki/extensions/VisualEditor' (shirayuki)
aad42f27 07:18:57.201          inside-wmf/mediawiki/extensions/FundraisingEmailUnsubscribe
------------------------------------------------------------------------------

Logged a message on IRC about it and pinged ops list.
Comment 10 christian 2013-04-08 00:12:14 UTC
Although I cannot reliably reproduce the problem, the culprit seems to
be gerrit's version bump of Apache SSHD to 0.6.0, which comes with
closing of idle ssh connections.

Upon 10 minutes of inactivity, gerrit disconnects Zuul, and Zuul has
to reconnect. Gerrit accepts the connection, but sometimes refuses to
stream events over the new connection. There seems to be some race
condition in gerrit when Zuul reconnects.

We can mitigate the problem by having at least 1 commit every 10
minutes... Go, programmers go :-))

To also have a solution until we get that commit density 24/7,
https://gerrit-review.googlesource.com/#/c/44351/
got submitted to upstream gerrit.

(We cannot simply downgrade gerrit to the last known good version, as
that would require database schema downgrading, which currently is not
supported by gerrit)
Comment 11 Krinkle 2013-04-08 03:20:28 UTC
Raising importance, this needs fixing ASAP. We need testing. Not just to save work, but because we (and by design and intention) rely on it automatically happen and performing the merge action on approval.

Also the lack of postmerge events is blocking (auto-deploy for certain projects).
Comment 12 Krinkle 2013-04-08 05:05:26 UTC
(In reply to comment #10)
> https://gerrit-review.googlesource.com/#/c/44351/
> got submitted to upstream gerrit.
> 

It hasn't been reviewed / merged yet however. How long are we willing to wait? Can we apply it to our fork for now?

Jenkins being down is causing lots of workflow problems throughout the different projects.
Comment 13 Antoine "hashar" Musso (WMF) 2013-04-08 07:54:52 UTC
For the context:

Zuul is a python daemon running on gallium. It establish a ssh connection as jenkins-bot on manganese.wikimedia.org and runs the 'gerrit stream-events' command.   It expect JSON object to be returned.

Looking at the Zuul log (fgrep GerritWatcher /var/log/zuul/zuul.log), it does emit a stacktrace every 10 minutes:


2013-04-08 05:50:48,028 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:00:53,176 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:10:58,266 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:21:03,383 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:31:08,472 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:41:13,594 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:51:18,687 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 07:01:23,782 ERROR gerrit.GerritWatcher: Exception on ssh event stream:

A full trace is:


2013-04-08 00:12:10,016 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
    self._listen(stdout, stderr)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
    self._read(stdout)
  File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
    data = json.loads(l)
  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded


Which is stream-events sending some invalid JSON, most probably something like "Connection timed out after 10 minutes".    It seems Zuul properly reconnect the ssh connection.


Whenever the stream-events for Zuul, it is also blocked for any other user even on a fresh new connection.  For example, I could run the stream-events command using my own account, fill a comment in Gerrit and I get no event on the channel.
Comment 14 Gerrit Notification Bot 2013-04-08 08:09:13 UTC
Related URL: https://gerrit.wikimedia.org/r/58059 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)
Comment 15 Gerrit Notification Bot 2013-04-08 08:41:27 UTC
Related URL: https://gerrit.wikimedia.org/r/58060 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)
Comment 16 Gerrit Notification Bot 2013-04-08 08:51:21 UTC
Related URL: https://gerrit.wikimedia.org/r/58063 (Gerrit Change I431ff970596aae03b3ab5276303ed18574892b13)
Comment 17 christian 2013-04-08 09:23:59 UTC
Until we can deploy a new gerrit build, we prevent timeouts by forcing a
recheck on
https://gerrit.wikimedia.org/r/#/c/58060/
every 4 minutes.
Comment 18 Chad H. 2013-04-08 10:34:56 UTC
(In reply to comment #12)
> (In reply to comment #10)
> > https://gerrit-review.googlesource.com/#/c/44351/
> > got submitted to upstream gerrit.
> > 
> 
> It hasn't been reviewed / merged yet however. How long are we willing to
> wait?
> Can we apply it to our fork for now?
> 
> Jenkins being down is causing lots of workflow problems throughout the
> different projects.

We don't have a fork, we use vanilla Gerrit :)
But yes, I'm looking into deploying this now rather than waiting on upstream to merge.
Comment 19 Chad H. 2013-04-08 12:04:45 UTC
Deployed the upstream patch just now.
Comment 20 Chad H. 2013-04-08 19:38:22 UTC
*** Bug 47004 has been marked as a duplicate of this bug. ***
Comment 21 Antoine "hashar" Musso (WMF) 2013-04-09 08:07:39 UTC
Lowering priority and severity since we have a workaround now.
Comment 22 Andre Klapper 2013-05-14 10:02:55 UTC
(In reply to comment #21)
> Lowering priority and severity since we have a workaround now.

Now actually doing so for priority too.
Comment 23 Antoine "hashar" Musso (WMF) 2013-06-10 13:11:24 UTC
*** Bug 49330 has been marked as a duplicate of this bug. ***
Comment 24 christian 2013-06-10 14:15:49 UTC
We previously rolled back to a gerrit that exposed the problem again :-(

In case we ever do this again: To keep the zuul<->gerrit connection
alive, it suffices to have at least one event fed to zuul each 10
minutes.

So for example clone our gerrit-ping repo.
Restore
  https://gerrit.wikimedia.org/r/#/c/58060/
Fetch patch set 1.
And run

  while true ; do date ; ssh gerrit.wikimedia.org gerrit review -m "'recheck'" --verified 0 75be73703978e75a25d5e3bb5685c51580af3060 ; sleep 240 ; done

in a shell (I had ssh configured to automatically supply username and gerrit port).

Gerrit 2.6-rc0-154-gfcdb34b is the first version that we deployed and
contained the fix.
Comment 25 Chad H. 2013-06-10 14:18:21 UTC
(In reply to comment #24)
> We previously rolled back to a gerrit that exposed the problem again :-(
> 

To expand on this, last week when we tried upgrading to 2.7-rc1-424-gef469ac we hit a bunch of problems so we rolled back. However, I rolled too far back to 2.6-rc0-144-gb1dadd2.war instead of 2.6-rc0-154-gfcdb34b which lacked the SSH timeout fix mentioned above.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links