Last modified: 2014-09-23 22:56:19 UTC
Besides the Ganglia statistics, the grid's status should be properly monitored and alarms set up. From the top of my head and without data to back it up: - Master alive and well (no threads in error state!), - every execution daemon alive and well, - count of jobs in error state doesn't exceed 5 % of all jobs running, - count of jobs pending doesn't exceed 5 % of all jobs running.
As with bug 51434 , I think this would be a very good step for improving the reliability of the services we provide -- and getting stats to show it. :-)