Last modified: 2014-09-17 15:29:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72869, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70869 - Search is sometimes slow on the Beta Cluster
Search is sometimes slow on the Beta Cluster
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-15 22:26 UTC by Greg Grossmeier
Modified: 2014-09-17 15:29 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Elastic search instances load average (21.62 KB, image/png)
2014-09-16 07:37 UTC, Antoine "hashar" Musso (WMF)
Details

Description Greg Grossmeier 2014-09-15 22:26:14 UTC
Rummana saw the issue described in bug 70103 again.

The search requests (either in the drop down on the top right, or within VE) are sometimes taking a lot longer than normal.

Looking at graphite I see a weird spike on one of the elastic search boxes:
http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410819776.01&from=-7days&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value
Comment 1 Greg Grossmeier 2014-09-15 22:26:45 UTC
(setting normal for now, but if it starts causing browser test failures or otherwise, we'll bump it up)
Comment 4 Antoine "hashar" Musso (WMF) 2014-09-16 07:37:30 UTC
Created attachment 16480 [details]
Elastic search instances load average
Comment 5 Antoine "hashar" Musso (WMF) 2014-09-16 16:34:30 UTC
Chad / Nik are the best point to investigate ElasticSearch related issue.  Maybe someone imported a bunch of articles on beta which caused a lot of indexing on ElasticSearch side.
Comment 6 Nik Everett 2014-09-16 16:37:53 UTC
I can have a look at it soon - yeah.  The Elasticsearch cluster in beta isn't designed for performance - just to be there and functional.
Comment 7 Nik Everett 2014-09-17 14:39:43 UTC
Did a bit of digging this morning.  Here is a graph of io load:
http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410964216.337&target=deployment-prep.deployment-elastic01.cpu.total.iowait.value&from=-96hours

The spike is our slow time.  It looks like we saw a spike in the number of queries but I can't be sure.  We keep query counts in ganglia but that doesn't seem to be working well today.

I'm willing to chalk it up to a spike in requests to beta and intentionally underpowered systems.
Comment 8 Yuvi Panda 2014-09-17 15:00:11 UTC
Note that ganglia on labs has been dead for a long time, and will remain so for the foreseeable future. Do send metrics to graphite instead for labs :)
Comment 9 Chris McMahon 2014-09-17 15:11:34 UTC
(In reply to Nik Everett from comment #7)

> I'm willing to chalk it up to a spike in requests to beta and intentionally
> underpowered systems.

I just want to underline this.  "Intentionally underpowered" is so that glitches like this will be noticed and investigated. 

Sometimes, like here it seems, the investigation turns up nothing much, but the underpowered nature of beta labs often triggers real problems that would be much more drastic at production scale.

Thanks Rummana, thanks Nik...
Comment 10 Nik Everett 2014-09-17 15:17:22 UTC
In this case I'm kind of blind because of lack of ganglia - its really a shame that we don't have it working and/or no one has found the time to port the ganglia monitoring to graphite.

Maybe relevant: I see request spikes in production that don't translate into huge load spikes because we use the pool counter to prevent it.  I don't believe beta has the pool counter configured at all.
Comment 11 Greg Grossmeier 2014-09-17 15:22:40 UTC
(In reply to Nik Everett from comment #10)
> Maybe relevant: I see request spikes in production that don't translate into
> huge load spikes because we use the pool counter to prevent it.  I don't
> believe beta has the pool counter configured at all.

At all as in? (What are the next steps to put that in place? Please file bugs :) )

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links