Last modified: 2012-11-10 14:45:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38993, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 36993 - dumps project overload GlusterFS and cause cluster failure
dumps project overload GlusterFS and cause cluster failure
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Normal major
: ---
Assigned To: Nobody - You can work on this!
:
: 36997 (view as bug list)
Depends on:
Blocks: 41967
  Show dependency treegraph
 
Reported: 2012-05-21 08:10 UTC by Antoine "hashar" Musso (WMF)
Modified: 2012-11-10 14:45 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Antoine "hashar" Musso (WMF) 2012-05-21 08:10:24 UTC
Everyday, all instances hosted on WMFLabs are made barely accessible from roughly 6:30am for about an hour. The symptoms are:
* very high load reported in ganglia for most instances
* ssh client reaching timeout
* `ls -l` being 
This is known to be related to I/O and how GlusterFS seems to be lacking in that area.

Regardless of GlusterFS, Ubuntu has a default daily cron set up at 6:25 UTC. Which also means that all instances start rotating or processing their logs at the same exact time.


There must be a cronjob on some of the instances that uses too much I/O. We would need some metrics in Ganglia about disk usage to find it out.
Comment 2 Peter Bena 2012-05-21 08:33:02 UTC
we have lot of instances if all of them start rotating logs, especially nagios with hundreds of mb of logs, that could be that
Comment 3 Peter Bena 2012-05-21 08:39:46 UTC
there is need to puppetize some "randomizer" which would make each instance have this default time different a bit, it shouldn't be hard to configure cron to do that. If you look in /etc/crontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# m h dom mon dow user  command
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )

we could as well replace it with

25 6    * * *   root    test -x /usr/sbin/anacron || sleep $((RANDOM%800+10)) || ( cd / && run-parts --report /etc/cron.daily )
Comment 4 Antoine "hashar" Musso (WMF) 2012-05-21 09:20:38 UTC
Please note RANDOM might not be available in /bin/sh (which is most probably dash and not bash).

Another way would be to spread the daily jobs would be to have the users to set them up via puppet.
Comment 5 Antoine "hashar" Musso (WMF) 2012-05-21 12:37:57 UTC
The labs cluster has been screwed for most of the morning. So that might not be just about cronjob.  Still unresponsive as of now.
Comment 6 Peter Bena 2012-05-21 13:16:16 UTC
Whatever that asks users to do anything will cause troubles. If RANDOM is not available we can create another work around, there are many options to create random numbers, including various binaries which are packaged in apt
Comment 7 Peter Bena 2012-05-21 13:18:17 UTC
Also, I am talking about the default ubuntu daily jobs, you can't require users to set them up using puppet. Some of labs users don't even know how unix works, and some of daily jobs shouldn't be just removed from system.
Comment 8 Antoine "hashar" Musso (WMF) 2012-05-21 20:27:59 UTC
See also bug 36868 : find a better home for /home/wikipedia
Comment 9 Antoine "hashar" Musso (WMF) 2012-05-22 14:05:34 UTC
We just had some kind of outage for the whole cluster. The virtualization cluster showed load gradually increasing at 13:20UTC :

http://ganglia.wikimedia.org/latest/?r=hour&cs=05%2F22%2F2012+13%3A00+&ce=05%2F22%2F2012+14%3A00+&m=load_report&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

At the sometime, the dumps project on labs starts having some network activity which corresponds to I/O activity over NFS:
http://ganglia.wmflabs.org/latest/graph.php?c=dumps&m=network_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F22%2F2012%2011%3A00%20&ce=05%2F22%2F2012%2014%3A00%20&st=1337694997&g=network_report&z=medium&c=dumps

I have seen the exact same behavior earlier this meaning where 30MBytes/s were output from a datadump host in eqiad and 30Mbytes/s were input in the dumps project. At the sametime, instances were unresponsive.


We need to find a workaround, some possible solutions:
- get the `dump` project to use some NFS share on real storage thus bypassing GlusterFS
- rate limit network bandwidth between the dataset1001 in eqiad and the labs
- find a parameter in GlusterFS that will throttle the connection

Other ideas?


Changing summary from: "Labs cluster dies daily at roughly 6:30 UTC"
To: "dumps project overload GlusterFS and cause cluster failure"

Raising severity since that makes the cluster unusable from time to time.
Comment 10 Ariel T. Glenn 2012-05-22 14:12:38 UTC
There is a gluster share which is supposed to be available across all lab instances, which has the last 5 good dumps in it.  I don't know if it's been made accessible to the instances yet.  It updates every day at around 4 am UTC. 

The point of that is so that no one has to download their own copies of the dumps to work on them in a labs project (wasting space and bandwidth).
Comment 11 Antoine "hashar" Musso (WMF) 2012-05-22 14:16:08 UTC
Following a discussion with Hydriz here is what he does:

- rsync dumps to is instance in /data/project/dumps (which hit glusterFS)
- upload the dumps to Internet Archive using curl and their S3 interface

So we are copying the data in Gluster FS  just to move them out after.  I guess the comment by Ariel above could be a good solution.
Comment 12 Antoine "hashar" Musso (WMF) 2012-05-22 14:41:11 UTC
Hydriz is going to upload to S3 from the copy Ariel is referring to in comment 10.
Comment 13 Antoine "hashar" Musso (WMF) 2012-05-22 14:43:09 UTC
Since we have found a workaround for the recent problems we had, I am closing this bug.

The root cause is GlusterFS that can be killed just by one instance doing some heavy I/O. That should be another bug.
Comment 14 Nemo 2012-11-10 14:44:21 UTC
*** Bug 36997 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links