Last modified: 2012-11-10 14:45:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38993, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 36993 - dumps project overload GlusterFS and cause cluster failure


Summary:	dumps project overload GlusterFS and cause cluster failure

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Duplicates:	36997 (view as bug list)
Depends on:
Blocks:	41967
	Show dependency tree / graph

Reported:	2012-05-21 08:10 UTC by Antoine "hashar" Musso (WMF)
Modified:	2012-11-10 14:45 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Antoine "hashar" Musso (WMF) 2012-05-21 08:10:24 UTC

Everyday, all instances hosted on WMFLabs are made barely accessible from roughly 6:30am for about an hour. The symptoms are:
* very high load reported in ganglia for most instances
* ssh client reaching timeout
* `ls -l` being 
This is known to be related to I/O and how GlusterFS seems to be lacking in that area.

Regardless of GlusterFS, Ubuntu has a default daily cron set up at 6:25 UTC. Which also means that all instances start rotating or processing their logs at the same exact time.


There must be a cronjob on some of the instances that uses too much I/O. We would need some metrics in Ganglia about disk usage to find it out.

Comment 1 Antoine "hashar" Musso (WMF) 2012-05-21 08:11:30 UTC

Ganglia graphs from 05/21/2012 5:00 to 05/21/2012 9:00

For virt*:
http://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

For all labs instances:
http://ganglia.wmflabs.org/latest/?r=hour&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00+&s=by+name&c=&tab=m&vn=

Comment 2 Peter Bena 2012-05-21 08:33:02 UTC

we have lot of instances if all of them start rotating logs, especially nagios with hundreds of mb of logs, that could be that

Comment 3 Peter Bena 2012-05-21 08:39:46 UTC

there is need to puppetize some "randomizer" which would make each instance have this default time different a bit, it shouldn't be hard to configure cron to do that. If you look in /etc/crontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# m h dom mon dow user  command
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )

we could as well replace it with

25 6    * * *   root    test -x /usr/sbin/anacron || sleep $((RANDOM%800+10)) || ( cd / && run-parts --report /etc/cron.daily )

Comment 4 Antoine "hashar" Musso (WMF) 2012-05-21 09:20:38 UTC

Please note RANDOM might not be available in /bin/sh (which is most probably dash and not bash).

Another way would be to spread the daily jobs would be to have the users to set them up via puppet.

Comment 5 Antoine "hashar" Musso (WMF) 2012-05-21 12:37:57 UTC

The labs cluster has been screwed for most of the morning. So that might not be just about cronjob.  Still unresponsive as of now.

Comment 6 Peter Bena 2012-05-21 13:16:16 UTC

Whatever that asks users to do anything will cause troubles. If RANDOM is not available we can create another work around, there are many options to create random numbers, including various binaries which are packaged in apt

Comment 7 Peter Bena 2012-05-21 13:18:17 UTC

Also, I am talking about the default ubuntu daily jobs, you can't require users to set them up using puppet. Some of labs users don't even know how unix works, and some of daily jobs shouldn't be just removed from system.

Comment 8 Antoine "hashar" Musso (WMF) 2012-05-21 20:27:59 UTC

See also bug 36868 : find a better home for /home/wikipedia

Comment 9 Antoine "hashar" Musso (WMF) 2012-05-22 14:05:34 UTC

We just had some kind of outage for the whole cluster. The virtualization cluster showed load gradually increasing at 13:20UTC :

http://ganglia.wikimedia.org/latest/?r=hour&cs=05%2F22%2F2012+13%3A00+&ce=05%2F22%2F2012+14%3A00+&m=load_report&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

At the sometime, the dumps project on labs starts having some network activity which corresponds to I/O activity over NFS:
http://ganglia.wmflabs.org/latest/graph.php?c=dumps&m=network_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F22%2F2012%2011%3A00%20&ce=05%2F22%2F2012%2014%3A00%20&st=1337694997&g=network_report&z=medium&c=dumps

I have seen the exact same behavior earlier this meaning where 30MBytes/s were output from a datadump host in eqiad and 30Mbytes/s were input in the dumps project. At the sametime, instances were unresponsive.


We need to find a workaround, some possible solutions:
- get the `dump` project to use some NFS share on real storage thus bypassing GlusterFS
- rate limit network bandwidth between the dataset1001 in eqiad and the labs
- find a parameter in GlusterFS that will throttle the connection

Other ideas?


Changing summary from: "Labs cluster dies daily at roughly 6:30 UTC"
To: "dumps project overload GlusterFS and cause cluster failure"

Raising severity since that makes the cluster unusable from time to time.

Comment 10 Ariel T. Glenn 2012-05-22 14:12:38 UTC

There is a gluster share which is supposed to be available across all lab instances, which has the last 5 good dumps in it.  I don't know if it's been made accessible to the instances yet.  It updates every day at around 4 am UTC. 

The point of that is so that no one has to download their own copies of the dumps to work on them in a labs project (wasting space and bandwidth).

Comment 11 Antoine "hashar" Musso (WMF) 2012-05-22 14:16:08 UTC

Following a discussion with Hydriz here is what he does:

- rsync dumps to is instance in /data/project/dumps (which hit glusterFS)
- upload the dumps to Internet Archive using curl and their S3 interface

So we are copying the data in Gluster FS  just to move them out after.  I guess the comment by Ariel above could be a good solution.

Comment 12 Antoine "hashar" Musso (WMF) 2012-05-22 14:41:11 UTC

Hydriz is going to upload to S3 from the copy Ariel is referring to in comment 10.

Comment 13 Antoine "hashar" Musso (WMF) 2012-05-22 14:43:09 UTC

Since we have found a workaround for the recent problems we had, I am closing this bug.

The root cause is GlusterFS that can be killed just by one instance doing some heavy I/O. That should be another bug.

Comment 14 Nemo 2012-11-10 14:44:21 UTC

*** Bug 36997 has been marked as a duplicate of this bug. ***

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links