Last modified: 2014-10-22 16:47:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72752, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70752 - [l10n] l10nupdate process should respect the scap lock file
[l10n] l10nupdate process should respect the scap lock file
Status: ASSIGNED
Product: Wikimedia
Classification: Unclassified
Deployment systems (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Sam Reed (reedy)
:
Depends on: 70443
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-12 03:17 UTC by Matthew Flaschen
Modified: 2014-10-22 16:47 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Full scap log (78.51 KB, text/plain)
2014-09-12 03:17 UTC, Matthew Flaschen
Details

Description Matthew Flaschen 2014-09-12 03:17:38 UTC
Created attachment 16449 [details]
Full scap log

This scap had a lot of errors (and does not seem to have had one of its desired effects, fixing the messages at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I think "file has vanished" may be one of the root ones.

Partial log (full log attached):

---
02:53:30 ['sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1064 returned [70]: 02:52:25 Copying to mw1064.eqiad.wmnet from mw1010.eqiad.wmnet
02:52:25 Started rsync common
file has vanished: "/php-1.24wmf21/cache/l10n/.l10n_cache-ak.cdb.FyWmNi" (in common)
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1536) [generator=3.0.9]
02:53:29 Finished rsync common (duration: 01m 04s)
02:53:29 Unhandled error:
Traceback (most recent call last):                                                                                                                   
  File "/srv/deployment/scap/scap/scap/cli.py", line 262, in run                                                                                     
    exit_status = app.main(extra_args)                                                                                                               
  File "/srv/deployment/scap/scap/scap/main.py", line 282, in main                                                                                   
    verbose=self.verbose                                                                                                                             
  File "/srv/deployment/scap/scap/scap/tasks.py", line 275, in sync_common                                                                           
    subprocess.check_call(rsync)                                                                                                                     
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call                                                                                   
    raise CalledProcessError(retcode, cmd)                                                                                                           
CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24                                                                      
02:53:29 sync-common failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24                                        
---

This this repeats for various languages.
Comment 1 Matthew Flaschen 2014-09-12 03:48:52 UTC
(In reply to Matthew Flaschen from comment #0)
> This scap had a lot of errors (and does not seem to have had one of its
> desired effects, fixing the messages at
> https://bits.wikimedia.org/en.wikipedia.org/load.
> php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I
> think "file has vanished" may be one of the root ones.

I still don't really know what any of the underlying problems were, or what fixed what, but if you follow that link you can see the RL message blob is now correct (confirmed on GUI).
Comment 2 Bryan Davis 2014-09-12 04:26:24 UTC
Was this the host where Roan killed the ssh session from tin or another host? In general this looks like rsync experiencing some sort of major on the receiving client side. With the horrible sync times and extreme load that the mw1010 server was under during this scap I'm actually not surprised to see a few errors like this occur.

I'm not sure there is much that could be done to adjust for this in the scap program itself. We could dial back the number of parallel processes we run or try to add some sort of load check in the rsync server selection process. The settings we are running now however have remained the same for many months and we have only seen these really bad sync times and high load on the rsync servers in the last week or two. Many (all?) of the rsync servers are job runner nodes. I'm sort of wondering if something has changed in the behavior or job load that is clashing with scap.

I'd really like to move from rsync to a git fetch or other method of synchronization that uses precomputed diffs and lessens the computational cost on both the client and server.
Comment 3 Greg Grossmeier 2014-09-12 15:56:31 UTC
Unsure of relatedness or same root-cause, but, at the same time (I think based on timestamps in the log) tin went crazy network-wise:

https://ganglia.wikimedia.org/latest/?r=custom&cs=09%2F11%2F2014+22%3A00+&ce=09%2F12%2F2014+05%3A00+&c=Miscellaneous+eqiad&h=tin.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS

Compare to a normal Thursday deploy:
https://ganglia.wikimedia.org/latest/?r=custom&cs=09%2F04%2F2014+22%3A00+&ce=09%2F05%2F2014+05%3A00+&c=Miscellaneous+eqiad&h=tin.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS

Twice the CPU and Memory as normal.

Looks like an l10nupdate and scap at the same time is too much for tin.
Comment 4 Bryan Davis 2014-09-12 16:01:34 UTC
(In reply to Greg Grossmeier from comment #3)
> 
> Looks like an l10nupdate and scap at the same time is too much for tin.

Oh man. They would be fighting over the content of the l10n cdb files and l10nupdate doesn't do the nicest things possible to sync the blobs with the cluster. The l10nupdate process should probably respect the scap lock file. As Sam pointed out last week l10nupdate should be changed to actually use scap/sync-common to update the cluster and use the json representation to transmit deltas to the mw servers rather than direct cdb syncs.
Comment 5 Greg Grossmeier 2014-09-12 16:04:43 UTC
(In reply to Bryan Davis from comment #4)
> As Sam pointed out last week l10nupdate should be changed to actually use
> scap/sync-common to update the cluster and use the json representation to
> transmit deltas to the mw servers rather than direct cdb syncs.

https://bugzilla.wikimedia.org/show_bug.cgi?id=70446
Comment 6 Greg Grossmeier 2014-09-12 17:53:02 UTC
Reedy: Can you take a look at this in relation to the "Are Scap and LocalisationUpdate actually fighting over l10nupdates?" issue?
Comment 7 Matthew Flaschen 2014-09-12 21:56:26 UTC
(In reply to Bryan Davis from comment #2)
> Was this the host where Roan killed the ssh session from tin or another
> host?

This was actually a separate scap.  I did two scaps:

[Thursday, September 11, 2014] [7:24:53 PM] <RoanKattouw> bd808: OK so should I just kill the ssh mw1063 process on tin so the scap will move on with 1 failure?
[Thursday, September 11, 2014] [7:24:54 PM] <bd808> AaronS: So this may be an issue in general? Or we need to not use job runners as rsync slaves anymore?
[Thursday, September 11, 2014] [7:25:01 PM] Nick James_F|Away is now known as James_F.
[Thursday, September 11, 2014] [7:25:01 PM] <bd808> RoanKattouw: Sure
[Thursday, September 11, 2014] [7:25:04 PM] <RoanKattouw> OK
[Thursday, September 11, 2014] [7:25:48 PM] <RoanKattouw> Done

...

[Thursday, September 11, 2014] [7:26:50 PM] <superm401> 23:25:11 Finished sync-apaches (duration: 87m 15s)

...

[Thursday, September 11, 2014] [7:29:49 PM] <logmsgbot> !log mattflaschen Finished scap: Deploy new GettingStarted recommendations A/B test (duration: 99m 34s)

...

[Thursday, September 11, 2014] [11:08:42 PM] <logmsgbot> !log mattflaschen Finished scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test (duration: 24m 38s)

This bug in regards to the second one (though the first one had its own issues).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links