Last modified: 2014-06-12 17:20:53 UTC
It seems /home partition on stats1002 filled up between 2014-04-03 and 2014-04-04, but no one was noticed by Icinga. I noticed when going through cron-mail and seeing that on 2014-04-04 04:30, one of my jobs failed with No space left on device for /home/qchris on stat1002. I freed some GBs for now, but $SOME_SERVICE (Icinga?) should warn in time about disks getting full. Let's get $SOME_SERVICE to alert about disks getting full.
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1527
I think this was also me; not quite sure how. I was messing around with the sampled logs in my home directory, but I'm not sure how that'd correspond. I'm going to investigate now I'm conscious. Evidently this is the week of Oliver Accidentally Revealing Oversight Issues With Our Cluster :D
(In reply to Oliver Keyes from comment #2) > I think this was also me Hahaha. Sorry to disappoint you again, but a "du" on /home showed that it was not you :-D But this bug is not about “Who filled up the disk”. Disks will always get full. Analyses start small, and grow ... and grow ... and grow. And then the disk is full. Meh. Much rather, this bug is about “Why did no service warn about disks getting full?”.
let's see if we can prioritize this in the next sprint.
Another issue where we need ops support.