Last modified: 2014-10-01 20:24:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73260, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71260 - Speed up health check.
Speed up health check.
Status: RESOLVED FIXED
Product: OCG
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: C. Scott Ananian
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-24 23:07 UTC by C. Scott Ananian
Modified: 2014-10-01 20:24 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description C. Scott Ananian 2014-09-24 23:07:45 UTC
icinga warning:
icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  

The port is responsive interactively, it looks like the timeout is just a bit too short for what the health check is trying to do.

In particular, the health check does a `du -s` of several cache directories, one of which is >6G now.  (The icinga limit for that directory is 40G.)  The latest deploy included a change which partially serialized that `du` (Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our recently starting to toe the timeout line.

We should improve the speed of the health check.  Probably the best way to do this is to cache the sizes of the directories or do the `du` step less frequently.  Alternatively we could add a `quick` check which didn't include the cache size step.  Re-adding some of the parallelism to the `du` might help some, but probably not enough to cover us when the cache directory climbs nearer its 40G limit.
Comment 1 Gerrit Notification Bot 2014-09-25 17:59:28 UTC
Change 162933 had a related patch set uploaded by Cscott:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933
Comment 2 Gerrit Notification Bot 2014-09-26 03:48:38 UTC
Change 162933 merged by jenkins-bot:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933
Comment 3 C. Scott Ananian 2014-09-26 16:24:08 UTC
Landed the above patch to cache stuff, but from local testing I expected that would decrease the amount of time taken for a (cached) health check to ~30ms.  Instead I'm still seeing ~7s request times.  So I think there's still an issue here.
Comment 4 Gerrit Notification Bot 2014-09-26 16:24:52 UTC
Change 163186 had a related patch set uploaded by Cscott:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186
Comment 5 Gerrit Notification Bot 2014-09-26 18:50:54 UTC
Change 163186 merged by BBlack:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186
Comment 6 C. Scott Ananian 2014-10-01 20:24:32 UTC
Fixed with https://gerrit.wikimedia.org/r/163997

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links