Last modified: 2014-10-01 20:24:32 UTC
icinga warning: icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds The port is responsive interactively, it looks like the timeout is just a bit too short for what the health check is trying to do. In particular, the health check does a `du -s` of several cache directories, one of which is >6G now. (The icinga limit for that directory is 40G.) The latest deploy included a change which partially serialized that `du` (Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our recently starting to toe the timeout line. We should improve the speed of the health check. Probably the best way to do this is to cache the sizes of the directories or do the `du` step less frequently. Alternatively we could add a `quick` check which didn't include the cache size step. Re-adding some of the parallelism to the `du` might help some, but probably not enough to cover us when the cache directory climbs nearer its 40G limit.
Change 162933 had a related patch set uploaded by Cscott: Speed up/cache directory size computation in health check. https://gerrit.wikimedia.org/r/162933
Change 162933 merged by jenkins-bot: Speed up/cache directory size computation in health check. https://gerrit.wikimedia.org/r/162933
Landed the above patch to cache stuff, but from local testing I expected that would decrease the amount of time taken for a (cached) health check to ~30ms. Instead I'm still seeing ~7s request times. So I think there's still an issue here.
Change 163186 had a related patch set uploaded by Cscott: Increase OCG warning/critical space thresholds. https://gerrit.wikimedia.org/r/163186
Change 163186 merged by BBlack: Increase OCG warning/critical space thresholds. https://gerrit.wikimedia.org/r/163186
Fixed with https://gerrit.wikimedia.org/r/163997