Last modified: 2014-09-23 23:58:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T68050, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 66050 - App servers get into bad states when coming back online/are newly provisioned due to puppet/salt craziness
App servers get into bad states when coming back online/are newly provisioned...
Status: NEW
Product: Wikimedia
Classification: Unclassified
Deployment systems (Other open bugs)
wmf-deployment
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-02 21:24 UTC by Greg Grossmeier
Modified: 2014-09-23 23:58 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Comment 1 Bryan Davis 2014-06-02 23:55:24 UTC
The puppet configuration that attempts to ensure that each apache server has the latest version of the mediawiki code and configuration is in ::mediawiki::sync. Specifically Exec['mw-sync'] and Exec['mw-sync-rebuild-cdbs'] combine to perform the end host scap steps of syncing with the state of the rsync server on tin. The Exec['mw-sync'] definition if marked as `refreshonly => true` which means it will only be applied if something else explicitly asks for it to run.

The explicit ask comes from ::mediawiki::web where Exec['apache-trigger-mw-sync'] is defined. This exec checks to see if any apache2 processes are running. If none are found, it notifies Exec['mw-sync']. The Service['apache'] define subscribes to Exec['mw-sync'] to start apache after Exec['mw-sync'] has completed.

There is at least one possible race condition in ::mediawiki::sync. Exec['mw-sync'] requires File['/usr/local/bin/sync-common'], but File['/usr/local/bin/sync-common'] is a symlink to /srv/deployment/scap/scap/bin/sync-common and that file is realized by Deployment:Target['scap'] (i.e. Trebuchet). There is no require to ensure that Trebuchet has deployed/updated sync-common before mw-sync invokes it.

It would probably be good to change the Service['apache'] subscribe to Exec['mw-sync-rebuild-cdbs'] so that Apache isn't started until after the l10n cache is present.
Comment 2 Bryan Davis 2014-06-03 00:03:29 UTC
I think there is an additional point of weakness here in the design of ::deployment::target. The creation of the salt grain notifies several execs. If the call to create the salt grain succeeds on the salt master but fails to notify the host applying via puppet due to a reporting timeout, these initial execs may never be called. This breaks puppet's notion of idempotent eventual consistency.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links