Last modified: 2014-10-22 16:47:46 UTC
Created attachment 16449 [details] Full scap log This scap had a lot of errors (and does not seem to have had one of its desired effects, fixing the messages at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I think "file has vanished" may be one of the root ones. Partial log (full log attached): --- 02:53:30 ['sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1064 returned [70]: 02:52:25 Copying to mw1064.eqiad.wmnet from mw1010.eqiad.wmnet 02:52:25 Started rsync common file has vanished: "/php-1.24wmf21/cache/l10n/.l10n_cache-ak.cdb.FyWmNi" (in common) rsync warning: some files vanished before they could be transferred (code 24) at main.c(1536) [generator=3.0.9] 02:53:29 Finished rsync common (duration: 01m 04s) 02:53:29 Unhandled error: Traceback (most recent call last): File "/srv/deployment/scap/scap/scap/cli.py", line 262, in run exit_status = app.main(extra_args) File "/srv/deployment/scap/scap/scap/main.py", line 282, in main verbose=self.verbose File "/srv/deployment/scap/scap/scap/tasks.py", line 275, in sync_common subprocess.check_call(rsync) File "/usr/lib/python2.7/subprocess.py", line 511, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24 02:53:29 sync-common failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24 --- This this repeats for various languages.
(In reply to Matthew Flaschen from comment #0) > This scap had a lot of errors (and does not seem to have had one of its > desired effects, fixing the messages at > https://bits.wikimedia.org/en.wikipedia.org/load. > php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I > think "file has vanished" may be one of the root ones. I still don't really know what any of the underlying problems were, or what fixed what, but if you follow that link you can see the RL message blob is now correct (confirmed on GUI).
Was this the host where Roan killed the ssh session from tin or another host? In general this looks like rsync experiencing some sort of major on the receiving client side. With the horrible sync times and extreme load that the mw1010 server was under during this scap I'm actually not surprised to see a few errors like this occur. I'm not sure there is much that could be done to adjust for this in the scap program itself. We could dial back the number of parallel processes we run or try to add some sort of load check in the rsync server selection process. The settings we are running now however have remained the same for many months and we have only seen these really bad sync times and high load on the rsync servers in the last week or two. Many (all?) of the rsync servers are job runner nodes. I'm sort of wondering if something has changed in the behavior or job load that is clashing with scap. I'd really like to move from rsync to a git fetch or other method of synchronization that uses precomputed diffs and lessens the computational cost on both the client and server.
Unsure of relatedness or same root-cause, but, at the same time (I think based on timestamps in the log) tin went crazy network-wise: https://ganglia.wikimedia.org/latest/?r=custom&cs=09%2F11%2F2014+22%3A00+&ce=09%2F12%2F2014+05%3A00+&c=Miscellaneous+eqiad&h=tin.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS Compare to a normal Thursday deploy: https://ganglia.wikimedia.org/latest/?r=custom&cs=09%2F04%2F2014+22%3A00+&ce=09%2F05%2F2014+05%3A00+&c=Miscellaneous+eqiad&h=tin.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS Twice the CPU and Memory as normal. Looks like an l10nupdate and scap at the same time is too much for tin.
(In reply to Greg Grossmeier from comment #3) > > Looks like an l10nupdate and scap at the same time is too much for tin. Oh man. They would be fighting over the content of the l10n cdb files and l10nupdate doesn't do the nicest things possible to sync the blobs with the cluster. The l10nupdate process should probably respect the scap lock file. As Sam pointed out last week l10nupdate should be changed to actually use scap/sync-common to update the cluster and use the json representation to transmit deltas to the mw servers rather than direct cdb syncs.
(In reply to Bryan Davis from comment #4) > As Sam pointed out last week l10nupdate should be changed to actually use > scap/sync-common to update the cluster and use the json representation to > transmit deltas to the mw servers rather than direct cdb syncs. https://bugzilla.wikimedia.org/show_bug.cgi?id=70446
Reedy: Can you take a look at this in relation to the "Are Scap and LocalisationUpdate actually fighting over l10nupdates?" issue?
(In reply to Bryan Davis from comment #2) > Was this the host where Roan killed the ssh session from tin or another > host? This was actually a separate scap. I did two scaps: [Thursday, September 11, 2014] [7:24:53 PM] <RoanKattouw> bd808: OK so should I just kill the ssh mw1063 process on tin so the scap will move on with 1 failure? [Thursday, September 11, 2014] [7:24:54 PM] <bd808> AaronS: So this may be an issue in general? Or we need to not use job runners as rsync slaves anymore? [Thursday, September 11, 2014] [7:25:01 PM] Nick James_F|Away is now known as James_F. [Thursday, September 11, 2014] [7:25:01 PM] <bd808> RoanKattouw: Sure [Thursday, September 11, 2014] [7:25:04 PM] <RoanKattouw> OK [Thursday, September 11, 2014] [7:25:48 PM] <RoanKattouw> Done ... [Thursday, September 11, 2014] [7:26:50 PM] <superm401> 23:25:11 Finished sync-apaches (duration: 87m 15s) ... [Thursday, September 11, 2014] [7:29:49 PM] <logmsgbot> !log mattflaschen Finished scap: Deploy new GettingStarted recommendations A/B test (duration: 99m 34s) ... [Thursday, September 11, 2014] [11:08:42 PM] <logmsgbot> !log mattflaschen Finished scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test (duration: 24m 38s) This bug in regards to the second one (though the first one had its own issues).