Last modified: 2014-02-22 19:29:04 UTC
I noticed this one a few yesterday that got fixed but I seem to have found some more. Execution nodes - tools-exec-02 - tools-exec-03 - tools-exec-04 - tools-exec-05 - tools-exec-06 - tools-exec-09 It might also be worth someone checking the webservers Also is there any way we can get ganglia to check this?
I take the blame for fixing some last night, but due to some hosts being not conveniently accessible, I gave up at some point :-) (pdsh is awesome if all hosts accept your credentials). autofs sucks big time. Just now, on tools-exec-02, I "service autofs reload" and later "service autofs stop && service autofs start", but it does neither mount /public/datasets nor does it give any log information why it failed. So I've mounted manually: | sudo mkdir /public/datasets && sudo mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets on all the above hosts. We should use Icinga for monitoring & alerts; Ganglia is more for performance data. I'll add some checks to my personal poor-mans-icinga script for now. Thankfully, in eqiad we will get rid of autofs and use Puppet mounts instead. My € 0,02: We shouldn't wait that long, but use it in pmtpa as well.
I meant Icinga :) Thanks for your work Tim! *is looking forward to eqiad*