Last modified: 2014-09-23 23:58:03 UTC
See: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140313-Deploy
The puppet configuration that attempts to ensure that each apache server has the latest version of the mediawiki code and configuration is in ::mediawiki::sync. Specifically Exec['mw-sync'] and Exec['mw-sync-rebuild-cdbs'] combine to perform the end host scap steps of syncing with the state of the rsync server on tin. The Exec['mw-sync'] definition if marked as `refreshonly => true` which means it will only be applied if something else explicitly asks for it to run. The explicit ask comes from ::mediawiki::web where Exec['apache-trigger-mw-sync'] is defined. This exec checks to see if any apache2 processes are running. If none are found, it notifies Exec['mw-sync']. The Service['apache'] define subscribes to Exec['mw-sync'] to start apache after Exec['mw-sync'] has completed. There is at least one possible race condition in ::mediawiki::sync. Exec['mw-sync'] requires File['/usr/local/bin/sync-common'], but File['/usr/local/bin/sync-common'] is a symlink to /srv/deployment/scap/scap/bin/sync-common and that file is realized by Deployment:Target['scap'] (i.e. Trebuchet). There is no require to ensure that Trebuchet has deployed/updated sync-common before mw-sync invokes it. It would probably be good to change the Service['apache'] subscribe to Exec['mw-sync-rebuild-cdbs'] so that Apache isn't started until after the l10n cache is present.
I think there is an additional point of weakness here in the design of ::deployment::target. The creation of the salt grain notifies several execs. If the call to create the salt grain succeeds on the salt master but fails to notify the host applying via puppet due to a reporting timeout, these initial execs may never be called. This breaks puppet's notion of idempotent eventual consistency.