Last modified: 2014-09-17 15:29:03 UTC
Rummana saw the issue described in bug 70103 again. The search requests (either in the drop down on the top right, or within VE) are sometimes taking a lot longer than normal. Looking at graphite I see a weird spike on one of the elastic search boxes: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410819776.01&from=-7days&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value
(setting normal for now, but if it starts causing browser test failures or otherwise, we'll bump it up)
Better graph: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410820101.127&from=-6hours&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value
Bah, those graphs are relative time based (last 6 hours) and will change. Here's a static one for today from 17:30 - 23:30 UTC: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410820186.749&from=17%3A30_20140915&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=deployment-prep.deployment-elastic02.loadavg.01.value&target=deployment-prep.deployment-elastic03.loadavg.01.value&target=deployment-prep.deployment-elastic04.loadavg.01.value&until=23%3A30_20140915
Created attachment 16480 [details] Elastic search instances load average
Chad / Nik are the best point to investigate ElasticSearch related issue. Maybe someone imported a bunch of articles on beta which caused a lot of indexing on ElasticSearch side.
I can have a look at it soon - yeah. The Elasticsearch cluster in beta isn't designed for performance - just to be there and functional.
Did a bit of digging this morning. Here is a graph of io load: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410964216.337&target=deployment-prep.deployment-elastic01.cpu.total.iowait.value&from=-96hours The spike is our slow time. It looks like we saw a spike in the number of queries but I can't be sure. We keep query counts in ganglia but that doesn't seem to be working well today. I'm willing to chalk it up to a spike in requests to beta and intentionally underpowered systems.
Note that ganglia on labs has been dead for a long time, and will remain so for the foreseeable future. Do send metrics to graphite instead for labs :)
(In reply to Nik Everett from comment #7) > I'm willing to chalk it up to a spike in requests to beta and intentionally > underpowered systems. I just want to underline this. "Intentionally underpowered" is so that glitches like this will be noticed and investigated. Sometimes, like here it seems, the investigation turns up nothing much, but the underpowered nature of beta labs often triggers real problems that would be much more drastic at production scale. Thanks Rummana, thanks Nik...
In this case I'm kind of blind because of lack of ganglia - its really a shame that we don't have it working and/or no one has found the time to port the ganglia monitoring to graphite. Maybe relevant: I see request spikes in production that don't translate into huge load spikes because we use the pool counter to prevent it. I don't believe beta has the pool counter configured at all.
(In reply to Nik Everett from comment #10) > Maybe relevant: I see request spikes in production that don't translate into > huge load spikes because we use the pool counter to prevent it. I don't > believe beta has the pool counter configured at all. At all as in? (What are the next steps to put that in place? Please file bugs :) )