Last modified: 2014-09-23 08:59:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63882, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 61882 - [Trebuchet] Salt times out on parsoid restarts


Summary:	[Trebuchet] Salt times out on parsoid restarts

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	Deployment systems (Other open bugs)
Version:	wmf-deployment
Hardware:	All All

Importance:	High normal (vote)
Target Milestone:	---
Assigned To:	Ryan Lane

URL:
Whiteboard:
Keywords:	ops

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-02-24 22:06 UTC by Gabriel Wicke
Modified:	2014-09-23 08:59 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Gabriel Wicke 2014-02-24 22:06:08 UTC

[13:29] <gwicke> we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration
[13:31] <gwicke> I see     cmd = ("sudo salt-call -l quiet --out json publish.runner "
[13:31] <gwicke>            "deploy.restart '{0}','{1}'")
[13:32] <gwicke> in /usr/local/bin/service-restart
[13:32] <gwicke> on tin
[13:34] <gwicke> from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5
[13:35] <gwicke> I think I got the timeout around 5 seconds
[13:36] <gwicke> parsoid needs up to about 90 seconds for a graceful restart

Comment 1 Gabriel Wicke 2014-02-28 18:26:16 UTC

From Ryan's main:
TL;DR:

Ensure batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

Long version:

I spent some time tracking this down tonight. From the master the following commands work reliably:

1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid
2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart 'parsoid/deploy'
3: salt-run deploy.restart 'parsoid/deploy' '10%'

#2 is a wrapper function for service.restart that maps a repo to a service, then restarts it (it has other future uses that make it necessary as well). #3 is a runner that can be easily referenced from peers like tin to make handling security easier, it basically just calls #2.

From tin we call:

sudo salt-call -l quiet --out json publish.runner deploy.restart 'parsoid/deploy','10%'

This calls #3 on the master via salt's publication system.

The problem here is that publish.runner has a default timeout. When that timeout is reached, the runner stops executing. Part of the issue is that it times out at all. The other part of this issue is that this command has no timeout argument: <http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>.

Since the command is being run as a batch, it splits the list of minions up into chunks and calls salt deploy.restart on each of them. If the sum of the time of all restarts is greater than the timeout of publish.runner any minions that didn't get called never do. Since we currently have a small number of minions this tends to often result in a single minion being left out. If we increased the number of minions, this would result in many more being left out.

I've opened a bug for this:

https://github.com/saltstack/salt/issues/10814

I've also added a pull request to allow a timeout argument for publish.runner:

https://github.com/saltstack/salt/pull/10815

In the meantime it's possible to workaround this issue by simply ensuring batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

I'm assuming that since parsoid is doing graceful restarts that this shouldn't be a problem.

Comment 2 Gabriel Wicke 2014-02-28 18:26:45 UTC

My reply:

Without batching this is taking down the Parsoid cluster for at least a few seconds:

a) Parsoid stops accepting new requests while restarting. This typically
takes a few seconds, but can take up to 60 seconds if all workers are busy.

b) After the restart the first request per worker will be slightly
slower as the JIT is doing its thing.

With Varnish backend retries some of this might not result in
client-visible failures, although I am not sure that Varnish retries
non-idempotent POSTs. That would mean that page saves are failing.

I think I'll rather (ask a root to) do a

  dsh -g parsoid service parsoid restart

instead until the salt bug is fixed.

Comment 3 Ryan Lane 2014-02-28 18:39:17 UTC

A root could also run this on the salt master:

  salt-run deploy.restart 'parsoid/deploy' '10%'

Comment 4 Gabriel Wicke 2014-02-28 18:42:00 UTC

(In reply to Ryan Lane from comment #3)
> A root could also run this on the salt master:
> 
>   salt-run deploy.restart 'parsoid/deploy' '10%'

Will that avoid the timeout?

Comment 5 Ryan Lane 2014-02-28 18:50:33 UTC

Yes. The bug is calling the runner from a peer. If it's called on the master directly it works as expected.

Comment 6 Ryan Lane 2014-02-28 18:51:20 UTC

Any of the three commands listed in comment 1 will work if run from the master.

Comment 7 Daniel Zahn 2014-03-04 16:31:49 UTC

(In reply to Ryan Lane from comment #3)
> A root could also run this on the salt master:
> 
>   salt-run deploy.restart 'parsoid/deploy' '10%'

to confirm, I ran that yesterday as root, got a "completed" after a little while and per Gabriel the result was good.

 21:07 mutante: restarting parsoid with salt-run deploy.restart 'parsoid/deploy' '10%'

Comment 8 Antoine "hashar" Musso (WMF) 2014-09-23 08:59:59 UTC

Ryan Lane wrote:
> The problem here is that publish.runner has a default timeout. When that
> timeout is reached, the runner stops executing. Part of the issue is that it
> times out at all. The other part of this issue is that this command has no
> timeout argument:
> <http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner>.
<snip>
> I've also added a pull request to allow a timeout argument for
> publish.runner:
> 
>   https://github.com/saltstack/salt/pull/10815

Ryan patch to add a timeout to salt.modules.publish.runner has been merged (as commit 72dfd80e in https://github.com/saltstack/salt/ ).  It is included in:

 $ git tag --contains 72dfd80e
 v2014.7
 v2014.7.0rc1
 v2014.7.0rc2
 $

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links