Last modified: 2014-09-17 15:29:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72869, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 70869 - Search is sometimes slow on the Beta Cluster


Summary:	Search is sometimes slow on the Beta Cluster

Status:	NEW

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	deployment-prep (beta) (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-09-15 22:26 UTC by Greg Grossmeier
Modified:	2014-09-17 15:29 UTC (History)
CC List:	12 users (show)

See Also:	70940
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Elastic search instances load average (21.62 KB, image/png) 2014-09-16 07:37 UTC, Antoine "hashar" Musso (WMF)	Details
Add an attachment (proposed patch, testcase, etc.)

Description Greg Grossmeier 2014-09-15 22:26:14 UTC

Rummana saw the issue described in bug 70103 again.

The search requests (either in the drop down on the top right, or within VE) are sometimes taking a lot longer than normal.

Looking at graphite I see a weird spike on one of the elastic search boxes:
http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410819776.01&from=-7days&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value

Comment 1 Greg Grossmeier 2014-09-15 22:26:45 UTC

(setting normal for now, but if it starts causing browser test failures or otherwise, we'll bump it up)

Comment 2 Greg Grossmeier 2014-09-15 22:28:46 UTC

Better graph: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410820101.127&from=-6hours&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value

Comment 3 Greg Grossmeier 2014-09-15 22:31:03 UTC

Bah, those graphs are relative time based (last 6 hours) and will change. Here's a static one for today from 17:30 - 23:30 UTC:
http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410820186.749&from=17%3A30_20140915&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value&until=23%3A30_20140915

Comment 4 Antoine "hashar" Musso (WMF) 2014-09-16 07:37:30 UTC

Created attachment 16480 [details]
Elastic search instances load average

Comment 5 Antoine "hashar" Musso (WMF) 2014-09-16 16:34:30 UTC

Chad / Nik are the best point to investigate ElasticSearch related issue.  Maybe someone imported a bunch of articles on beta which caused a lot of indexing on ElasticSearch side.

Comment 6 Nik Everett 2014-09-16 16:37:53 UTC

I can have a look at it soon - yeah.  The Elasticsearch cluster in beta isn't designed for performance - just to be there and functional.

Comment 7 Nik Everett 2014-09-17 14:39:43 UTC

Did a bit of digging this morning.  Here is a graph of io load:
http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410964216.337&target=deployment-prep.deployment-elastic01.cpu.total.iowait.value&from=-96hours

The spike is our slow time.  It looks like we saw a spike in the number of queries but I can't be sure.  We keep query counts in ganglia but that doesn't seem to be working well today.

I'm willing to chalk it up to a spike in requests to beta and intentionally underpowered systems.

Comment 8 Yuvi Panda 2014-09-17 15:00:11 UTC

Note that ganglia on labs has been dead for a long time, and will remain so for the foreseeable future. Do send metrics to graphite instead for labs :)

Comment 9 Chris McMahon 2014-09-17 15:11:34 UTC

(In reply to Nik Everett from comment #7)

> I'm willing to chalk it up to a spike in requests to beta and intentionally
> underpowered systems.

I just want to underline this.  "Intentionally underpowered" is so that glitches like this will be noticed and investigated. 

Sometimes, like here it seems, the investigation turns up nothing much, but the underpowered nature of beta labs often triggers real problems that would be much more drastic at production scale.

Thanks Rummana, thanks Nik...

Comment 10 Nik Everett 2014-09-17 15:17:22 UTC

In this case I'm kind of blind because of lack of ganglia - its really a shame that we don't have it working and/or no one has found the time to port the ganglia monitoring to graphite.

Maybe relevant: I see request spikes in production that don't translate into huge load spikes because we use the pool counter to prevent it.  I don't believe beta has the pool counter configured at all.

Comment 11 Greg Grossmeier 2014-09-17 15:22:40 UTC

(In reply to Nik Everett from comment #10)
> Maybe relevant: I see request spikes in production that don't translate into
> huge load spikes because we use the pool counter to prevent it.  I don't
> believe beta has the pool counter configured at all.

At all as in? (What are the next steps to put that in place? Please file bugs :) )

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links