Last modified: 2014-09-23 08:59:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63882, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61882 - [Trebuchet] Salt times out on parsoid restarts
[Trebuchet] Salt times out on parsoid restarts
Status: NEW
Product: Wikimedia
Classification: Unclassified
Deployment systems (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Ryan Lane
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-24 22:06 UTC by Gabriel Wicke
Modified: 2014-09-23 08:59 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2014-02-24 22:06:08 UTC
[13:29] <gwicke> we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration
[13:31] <gwicke> I see     cmd = ("sudo salt-call -l quiet --out json publish.runner "
[13:31] <gwicke>            "deploy.restart '{0}','{1}'")
[13:32] <gwicke> in /usr/local/bin/service-restart
[13:32] <gwicke> on tin
[13:34] <gwicke> from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5
[13:35] <gwicke> I think I got the timeout around 5 seconds
[13:36] <gwicke> parsoid needs up to about 90 seconds for a graceful restart
Comment 1 Gabriel Wicke 2014-02-28 18:26:16 UTC
From Ryan's main:
TL;DR:

Ensure batching doesn't occur by setting the batch value to 100%:

  service-restart --repo 'parsoid/deploy' --batch='100%'

Long version:

I spent some time tracking this down tonight. From the master the following commands work reliably:

  1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid
  2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart 'parsoid/deploy'
  3: salt-run deploy.restart 'parsoid/deploy' '10%'

#2 is a wrapper function for service.restart that maps a repo to a service, then restarts it (it has other future uses that make it necessary as well). #3 is a runner that can be easily referenced from peers like tin to make handling security easier, it basically just calls #2.

From tin we call:

  sudo salt-call -l quiet --out json publish.runner deploy.restart 'parsoid/deploy','10%'

This calls #3 on the master via salt's publication system.

The problem here is that publish.runner has a default timeout. When that timeout is reached, the runner stops executing. Part of the issue is that it times out at all. The other part of this issue is that this command has no timeout argument: <http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>.

Since the command is being run as a batch, it splits the list of minions up into chunks and calls salt deploy.restart on each of them. If the sum of the time of all restarts is greater than the timeout of publish.runner any minions that didn't get called never do. Since we currently have a small number of minions this tends to often result in a single minion being left out. If we increased the number of minions, this would result in many more being left out.

I've opened a bug for this:

  https://github.com/saltstack/salt/issues/10814

I've also added a pull request to allow a timeout argument for publish.runner:

  https://github.com/saltstack/salt/pull/10815

In the meantime it's possible to workaround this issue by simply ensuring batching doesn't occur by setting the batch value to 100%:

  service-restart --repo 'parsoid/deploy' --batch='100%'

I'm assuming that since parsoid is doing graceful restarts that this shouldn't be a problem.
Comment 2 Gabriel Wicke 2014-02-28 18:26:45 UTC
My reply:

Without batching this is taking down the Parsoid cluster for at least a few seconds:

a) Parsoid stops accepting new requests while restarting. This typically
takes a few seconds, but can take up to 60 seconds if all workers are busy.

b) After the restart the first request per worker will be slightly
slower as the JIT is doing its thing.

With Varnish backend retries some of this might not result in
client-visible failures, although I am not sure that Varnish retries
non-idempotent POSTs. That would mean that page saves are failing.

I think I'll rather (ask a root to) do a

  dsh -g parsoid service parsoid restart

instead until the salt bug is fixed.
Comment 3 Ryan Lane 2014-02-28 18:39:17 UTC
A root could also run this on the salt master:

  salt-run deploy.restart 'parsoid/deploy' '10%'
Comment 4 Gabriel Wicke 2014-02-28 18:42:00 UTC
(In reply to Ryan Lane from comment #3)
> A root could also run this on the salt master:
> 
>   salt-run deploy.restart 'parsoid/deploy' '10%'

Will that avoid the timeout?
Comment 5 Ryan Lane 2014-02-28 18:50:33 UTC
Yes. The bug is calling the runner from a peer. If it's called on the master directly it works as expected.
Comment 6 Ryan Lane 2014-02-28 18:51:20 UTC
Any of the three commands listed in comment 1 will work if run from the master.
Comment 7 Daniel Zahn 2014-03-04 16:31:49 UTC
(In reply to Ryan Lane from comment #3)
> A root could also run this on the salt master:
> 
>   salt-run deploy.restart 'parsoid/deploy' '10%'

to confirm, I ran that yesterday as root, got a "completed" after a little while and per Gabriel the result was good.

 21:07 mutante: restarting parsoid with salt-run deploy.restart 'parsoid/deploy' '10%'
Comment 8 Antoine "hashar" Musso (WMF) 2014-09-23 08:59:59 UTC
Ryan Lane wrote:
> The problem here is that publish.runner has a default timeout. When that
> timeout is reached, the runner stops executing. Part of the issue is that it
> times out at all. The other part of this issue is that this command has no
> timeout argument:
> <http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>.
<snip>
> I've also added a pull request to allow a timeout argument for
> publish.runner:
> 
>   https://github.com/saltstack/salt/pull/10815

Ryan patch to add a timeout to salt.modules.publish.runner has been merged (as commit 72dfd80e in https://github.com/saltstack/salt/ ).  It is included in:

 $ git tag --contains 72dfd80e
 v2014.7
 v2014.7.0rc1
 v2014.7.0rc2
 $

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links