Last modified: 2014-07-10 18:40:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69817, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67817 - Monitor for anomalies/spikes in read failures of memcached
Monitor for anomalies/spikes in read failures of memcached
Status: NEW
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-10 18:28 UTC by Greg Grossmeier
Modified: 2014-07-10 18:40 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Greg Grossmeier 2014-07-10 18:28:29 UTC
(Came out of https://wikitech.wikimedia.org/wiki/Incident_documentation/20140517-bits )

Discussion:

Timo: In retrospect we saw that we had data in logstash clearly indicating a massive increase in read failures from this memcached instance (basicaly from < 1% to nearly 100%). This could and should be monitored by icinga and reported to ops automatically. This would've helped us catch it much earlier.

Chat from irc on 2014-07-09:
[10:16]  <Krinkle>  For something that is logged in logstash (e.g. memcached errors). What is the strategy you'd typically take to monitor it icinga? Is there a step in between or would you actually have icinga use logstash?
[10:16]  <Krinkle>  I think the latter should be possible for more complex queries or aggregated data. Though I reckon in case of memcached there's probably a more direct approach possible.
[10:18]  <bd808>  Good question. Logstash by itself can do point-in-time monitoring, but it really has no useful way to alert on trends itself. 
[10:19]  <Krinkle>  I think most critical thigns should probably be polled by icinga directly. But more on multiple ocasions have I used logstash to quite easily pinpoint where an error came from. And it'd be useful to have those trends also result in pings to ops (perhaps not as critical via text but at least an irc ping would be useful).
[10:20]  <bd808>  One way™ to do it would be to graph trends in graphite driven by counts made by logstash and alert with icinga when the trend does something.
[10:20]  <Krinkle>  Right now logstash is mostly polling and digging manually, after the fact. That's immensly useful and it's good at that. But I think it has more potential.
[10:20]  <Krinkle>  Ah, I see. So it'd go to graphite after logstash. Interesting.
[10:20]  <bd808>   We aren't doing it now, but logstash can feed graphite in a statsd fashion
[10:20]  <Krinkle>  Right. 
[10:21]  <Krinkle>  For some reason I thought they might also be able to feed graphite from the source that feeds logstash.
[10:21]  <Krinkle>   guess that's still possible, unless the source is distributed  (or if the query is more advanced). In which case using logstash in between makes sense
[10:21] <  Krinkle> (or if the query is more advanced)
[10:22] bd808  nods
[10:22]  <Krinkle>  cool

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links