Last modified: 2012-11-10 14:45:57 UTC
Everyday, all instances hosted on WMFLabs are made barely accessible from roughly 6:30am for about an hour. The symptoms are: * very high load reported in ganglia for most instances * ssh client reaching timeout * `ls -l` being This is known to be related to I/O and how GlusterFS seems to be lacking in that area. Regardless of GlusterFS, Ubuntu has a default daily cron set up at 6:25 UTC. Which also means that all instances start rotating or processing their logs at the same exact time. There must be a cronjob on some of the instances that uses too much I/O. We would need some metrics in Ganglia about disk usage to find it out.
Ganglia graphs from 05/21/2012 5:00 to 05/21/2012 9:00 For virt*: http://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 For all labs instances: http://ganglia.wmflabs.org/latest/?r=hour&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00+&s=by+name&c=&tab=m&vn=
we have lot of instances if all of them start rotating logs, especially nagios with hundreds of mb of logs, that could be that
there is need to puppetize some "randomizer" which would make each instance have this default time different a bit, it shouldn't be hard to configure cron to do that. If you look in /etc/crontab SHELL=/bin/sh PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # m h dom mon dow user command 17 * * * * root cd / && run-parts --report /etc/cron.hourly 25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) 47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ) 52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) we could as well replace it with 25 6 * * * root test -x /usr/sbin/anacron || sleep $((RANDOM%800+10)) || ( cd / && run-parts --report /etc/cron.daily )
Please note RANDOM might not be available in /bin/sh (which is most probably dash and not bash). Another way would be to spread the daily jobs would be to have the users to set them up via puppet.
The labs cluster has been screwed for most of the morning. So that might not be just about cronjob. Still unresponsive as of now.
Whatever that asks users to do anything will cause troubles. If RANDOM is not available we can create another work around, there are many options to create random numbers, including various binaries which are packaged in apt
Also, I am talking about the default ubuntu daily jobs, you can't require users to set them up using puppet. Some of labs users don't even know how unix works, and some of daily jobs shouldn't be just removed from system.
See also bug 36868 : find a better home for /home/wikipedia
We just had some kind of outage for the whole cluster. The virtualization cluster showed load gradually increasing at 13:20UTC : http://ganglia.wikimedia.org/latest/?r=hour&cs=05%2F22%2F2012+13%3A00+&ce=05%2F22%2F2012+14%3A00+&m=load_report&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 At the sometime, the dumps project on labs starts having some network activity which corresponds to I/O activity over NFS: http://ganglia.wmflabs.org/latest/graph.php?c=dumps&m=network_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F22%2F2012%2011%3A00%20&ce=05%2F22%2F2012%2014%3A00%20&st=1337694997&g=network_report&z=medium&c=dumps I have seen the exact same behavior earlier this meaning where 30MBytes/s were output from a datadump host in eqiad and 30Mbytes/s were input in the dumps project. At the sametime, instances were unresponsive. We need to find a workaround, some possible solutions: - get the `dump` project to use some NFS share on real storage thus bypassing GlusterFS - rate limit network bandwidth between the dataset1001 in eqiad and the labs - find a parameter in GlusterFS that will throttle the connection Other ideas? Changing summary from: "Labs cluster dies daily at roughly 6:30 UTC" To: "dumps project overload GlusterFS and cause cluster failure" Raising severity since that makes the cluster unusable from time to time.
There is a gluster share which is supposed to be available across all lab instances, which has the last 5 good dumps in it. I don't know if it's been made accessible to the instances yet. It updates every day at around 4 am UTC. The point of that is so that no one has to download their own copies of the dumps to work on them in a labs project (wasting space and bandwidth).
Following a discussion with Hydriz here is what he does: - rsync dumps to is instance in /data/project/dumps (which hit glusterFS) - upload the dumps to Internet Archive using curl and their S3 interface So we are copying the data in Gluster FS just to move them out after. I guess the comment by Ariel above could be a good solution.
Hydriz is going to upload to S3 from the copy Ariel is referring to in comment 10.
Since we have found a workaround for the recent problems we had, I am closing this bug. The root cause is GlusterFS that can be killed just by one instance doing some heavy I/O. That should be another bug.
*** Bug 36997 has been marked as a duplicate of this bug. ***