Last modified: 2014-04-07 19:35:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T51757, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 49757 - Better monitoring and error reporting of Errors and Exceptions
Better monitoring and error reporting of Errors and Exceptions
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Nobody - You can work on this!
deploysprint-13
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-18 17:20 UTC by Greg Grossmeier
Modified: 2014-04-07 19:35 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Greg Grossmeier 2013-06-18 17:20:59 UTC
As stated by Ori after a nasty account creation bug (bug 49727):

"Errors and exceptions are currently broadcast to fluorine and vanadium via
UDP. I have code that parses the stream and generates the Ganglia graphs,
but it isn't hooked up to Icinga or any other form of monitoring. Would
anyone from ops want to pair up with me on this?"

Let's do this.
Comment 2 Greg Grossmeier 2013-07-22 21:44:11 UTC
Ori: Do you have something started that you can share on this bug? A project page or anything?

hashar suggested you might log the fatals to a db so that we could enlist Analytics to work on a real dashboard for it.
Comment 3 Ori Livneh 2013-07-23 10:24:17 UTC
(In reply to comment #2)
> Ori: Do you have something started that you can share on this bug? A project
> page or anything?

Not yet, but close. I need another day or two.
 
> hashar suggested you might log the fatals to a db so that we could enlist
> Analytics to work on a real dashboard for it.

Yes; it's a good idea :)
Comment 4 Gerrit Notification Bot 2013-07-24 08:48:50 UTC
Change 75560 had a related patch set uploaded by Ori.livneh:
(WIP) Parse errors and write to MongoDB

https://gerrit.wikimedia.org/r/75560
Comment 5 Ori Livneh 2013-07-25 07:57:46 UTC
Some notes about how things are currently configured:

MediaWiki can report errors to a remote host via UDP. The MediaWiki instances on the production cluster are configured to log to a host named 'fluorine'. This is done by specifying its address as the value of $wmfUdp2logDest in CommonSettings.php (in operations/mediawiki-config.git).

The MediaWiki instances that power the beta cluster set $wmfUdp2logDest to 'deployment-bastion' (<https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000390>), a Labs instance which plays the role of fluorine. It writes log data to files in /home/wikipedia/logs. Exceptions and fatals are respectively logged to exception.log and fatal.log in that directory.

When I first started looking at these logs, I didn't want to mess with the file-based logging, since it's an important service that developers rely on. So I submitted a patch to have fluorine stream the log data as it receives it to an another host (vanadium), in addition to writing it to disk. On vanadium I have a script that is generating the Ganglia graphs at <http://ur1.ca/edq1f>.

Yesterday I submitted change Ia0cc8de43 and Ryan merged it. That change reproduces the state of affairs described above (i.e. the duplication of the log stream to two destinations, fluorine and vanadium) on the beta cluster. It does so by having deployment-bastion forward a copy of the log data to a new instance, deployment-fluoride (<https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000084c>).

So the TL;DR is that there is an instance on the beta cluster (deployment-fluoride) that receives a live stream of errors and fatals being generated on the beta cluster MediaWikis, and we're free to use it as a sandbox for trying out different ways of capturing and representing this data.

I've only taken some initial steps, which is to take the stream of exceptions and fatals (which follow an idiosyncratic format that is not easy to analyze) and transform each error report into a JSON document. This is the work done in Ia0cc8de43 (<https://gerrit.wikimedia.org/r/#/c/75560/>). Or "half-done", as I should say, since I've discovered a couple of bugs that I haven't yet had a chance to fix.

The nice thing about JSON is that most modern languages have built-in modules in their standard library for handling it. So the status quo is that pending a couple of bugfixes there will shortly be streaming JSON service on deployment-fluoride that publishes MediaWiki error and exception reports as machine-readable objects.

In this state, the logs are quite easy to pipe into a data store or a visualization framework. We have to figure out what exactly we want to do, though, and then spec out some solution, ideally using solid off-the-shelf solutions where such solutions exist.

Some ideas to get the ball rolling:
https://getsentry.com/welcome/ (packages itself as a paid service, but the software is open-source).
http://logstash.net/

We could also build our own custom UI for spelunking the data.
Comment 6 Antoine "hashar" Musso (WMF) 2013-07-25 14:02:41 UTC
See also bug 52026 about documenting on wikitech our fatal/exception stuff
Comment 7 Ori Livneh 2013-11-13 10:40:53 UTC
This kind of bug is difficult to close because there's no clear criterion for considering it resolved. The 'exception-json' log bucket on fluorine got enabled today, so let's pick that as an arbitrary marker and mark this resolved, even though we clearly need to do way more work on logging.
Comment 8 MZMcBride 2013-11-14 00:07:30 UTC
(In reply to comment #7)
> This kind of bug is difficult to close because there's no clear criterion for
> considering it resolved. The 'exception-json' log bucket on fluorine got
> enabled today, so let's pick that as an arbitrary marker and mark this
> resolved, even though we clearly need to do way more work on logging.

An alternative is to make this bug a tracking bug, but [[WP:OKAY]].
Comment 9 Gerrit Notification Bot 2014-04-07 19:35:17 UTC
Change 75560 abandoned by Ori.livneh:
Parse MediaWiki fatals/exceptions and republish as JSON stream

https://gerrit.wikimedia.org/r/75560

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links