Last modified: 2014-10-01 20:24:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73260, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71260 - Speed up health check.


Summary:	Speed up health check.

Status:	RESOLVED FIXED

Product:	OCG
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	C. Scott Ananian

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-09-24 23:07 UTC by C. Scott Ananian
Modified:	2014-10-01 20:24 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description C. Scott Ananian 2014-09-24 23:07:45 UTC

icinga warning:
icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  

The port is responsive interactively, it looks like the timeout is just a bit too short for what the health check is trying to do.

In particular, the health check does a `du -s` of several cache directories, one of which is >6G now.  (The icinga limit for that directory is 40G.)  The latest deploy included a change which partially serialized that `du` (Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our recently starting to toe the timeout line.

We should improve the speed of the health check.  Probably the best way to do this is to cache the sizes of the directories or do the `du` step less frequently.  Alternatively we could add a `quick` check which didn't include the cache size step.  Re-adding some of the parallelism to the `du` might help some, but probably not enough to cover us when the cache directory climbs nearer its 40G limit.

Comment 1 Gerrit Notification Bot 2014-09-25 17:59:28 UTC

Change 162933 had a related patch set uploaded by Cscott:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933

Comment 2 Gerrit Notification Bot 2014-09-26 03:48:38 UTC

Change 162933 merged by jenkins-bot:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933

Comment 3 C. Scott Ananian 2014-09-26 16:24:08 UTC

Landed the above patch to cache stuff, but from local testing I expected that would decrease the amount of time taken for a (cached) health check to ~30ms.  Instead I'm still seeing ~7s request times.  So I think there's still an issue here.

Comment 4 Gerrit Notification Bot 2014-09-26 16:24:52 UTC

Change 163186 had a related patch set uploaded by Cscott:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186

Comment 5 Gerrit Notification Bot 2014-09-26 18:50:54 UTC

Change 163186 merged by BBlack:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186

Comment 6 C. Scott Ananian 2014-10-01 20:24:32 UTC

Fixed with https://gerrit.wikimedia.org/r/163997

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links