Last modified: 2014-01-22 17:53:39 UTC
To report php errors, or database errors such as the following: ---- Database error A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was: (SQL query hidden) from within function "LinksUpdate::incrTableUpdate". Database returned error "1205: Lock wait timeout exceeded; try restarting transaction (10.0.6.41)". ----- An irc bot could be written that reports db errors, and perhaps php/mediawiki errors or other kind of "should be rare" errors as well. According to Reedy there's a "global db/sql error" file on fenari.
Since these should be "rare" is there really all that much need for it?
(In reply to comment #1) > Since these should be "rare" is there really all that much need for it? They *should* be rare, but that doesn't mean that they are ;) Interesting files on fenari are: /home/wikipedia/syslog/apache.log -- Aggregated error.log for all Apaches. Needs some filtering to be usable; grep -i fatal works well for me /home/wikipedia/log/dberror.log -- DB errors Some additional notes: * these files have different formats * repeated errors have to be filtered out so this doesn't get too noisy * srv numbers should be reported, as well as db numbers for DB errors
(In reply to comment #2) > (In reply to comment #1) > > Since these should be "rare" is there really all that much need for it? > They *should* be rare, but that doesn't mean that they are ;) > > Interesting files on fenari are: > /home/wikipedia/syslog/apache.log -- Aggregated error.log for all Apaches. > Needs some filtering to be usable; grep -i fatal works well for me > /home/wikipedia/log/dberror.log -- DB errors > > Some additional notes: > * these files have different formats > * repeated errors have to be filtered out so this doesn't get too noisy > * srv numbers should be reported, as well as db numbers for DB errors Could you publish a sample of each somewhere ?
I can later if Roan doesn't get round to it before me
Censored sample of dberror.log: Tue Apr 12 6:35:04 UTC 2011 srv254 enwiki Error connecting to 10.0.6.22: Can't connect to MySQL server on '10.0.6.22' (4) (10.0.6.22) Tue Apr 12 6:35:15 UTC 2011 srv265 dewiki User::invalidateCache 10.0.6.33 1205 Lock wait timeout exceeded; Try restarting transaction (10.0.6.33) UPDATE `user` SET user_touched = 'CENSORED' WHERE user_id = 'CENSORED' Tue Apr 12 6:51:14 UTC 2011 srv278 ruwiki GlobalUsage::insertLinks 10.0.6.41 1062 Duplicate entry 'Houston_City_Hall_from_Hermann_Square_(HDR).jpg-ruwiki-4401' for key 'PRIMARY' (10.0.6.41) INSERT INTO `globalimagelinks` (gil_wiki,gil_page,gil_page_namespace_id,gil_page_namespace,gil_page_title,gil_to) VALUES ('ruwiki','4401','0','','Заглавная_страница','Houston_City_Hall_from_Hermann_Square_(HDR).jpg') Tue Apr 12 6:52:33 UTC 2011 srv163 frwiki Job::pop 10.0.6.39 1213 Deadlock found when trying to get lock; Try restarting transaction (10.0.6.39) DELETE FROM `job` WHERE job_id = '136559964' Censored sample of grep -i fatal apache.log : Apr 12 07:13:18 10.0.8.3 apache2[6295]: PHP Fatal error: Maximum execution time of CENSORED seconds exceeded in /usr/local/apache/common-local/php-1.17/includes/parser/Parser.php on line 3202 Apr 12 07:19:48 10.0.8.2 apache2[3887]: PHP Fatal error: Allowed memory size of CENSORED bytes exhausted (tried to allocate CENSORED bytes) in /usr/local/apache/common-local/php-1.17/includes/parser/LinkHolderArray.php on line 265 Apr 12 08:48:56 10.0.8.18 apache2[14920]: PHP Fatal error: Call to a member function isRedirect() on a non-object in /usr/local/apache/common-local/php-1.17/extensions/Collection/Collection.php on line 369 As you can see the DB servers in dberror.log and the srv servers in apache.log are stored as IP addresses, so you'd need to resolve those: $ host 10.0.8.18 18.8.0.10.in-addr.arpa domain name pointer srv268.pmtpa.wmnet. I'm not sure the censoring of the limits in apache.log was necessary, but I do think we'll want to censor SQL queries before posting them to a public channel. There is code for this in MW already (censoring and generalizing SQL queries for profiling purposes), somewhere.
(In reply to comment #5) > Censored sample of dberror.log: > > Tue Apr 12 6:35:04 UTC 2011 srv254 enwiki Error connecting to 10.0.6.22: > Can't connect to MySQL server on '10.0.6.22' (4) (10.0.6.22) Looks good. > Censored sample of grep -i fatal apache.log : > Apr 12 07:13:18 10.0.8.3 apache2[6295]: PHP Fatal error: Maximum execution > time of CENSORED seconds exceeded i Why only fatals though ? I think we should keep our code conventions to trunk to wmf as well, no notices, warnings or fatals should appear. Although since we're just getting started on this, it makes sense to start with a filtered output to existing channels, but an unfiltered output could be set as well. ie. #wikimedia-debug or whatever. (unfiltered, not uncensored) > I'm not sure the censoring of the limits in apache.log was necessary, but I do > think we'll want to censor SQL queries before posting them to a public channel. > There is code for this in MW already (censoring and generalizing SQL queries > for profiling purposes), somewhere. Nice! > As you can see the DB servers in dberror.log and the srv servers in apache.log > are stored as IP addresses, so you'd need to resolve those: Is this information available on noc.wikimedia.org as well ? There are some IPs and names relations there, not sure if these can or should be there as well. In mc.php there's 46 => '10.0.8.18:11000',
(In reply to comment #6) > Why only fatals though ? Because there's lots of garbage like this polluting the logs all the time: Apr 12 06:31:22 10.0.8.21 apache2[29754]: [error] [client 208.80.152.81] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/meta/style, referer: http://cursilloswfla.org/ Apr 12 06:31:22 10.0.8.21 apache2[29880]: [error] [client 208.80.152.71] (36)File name too long: access to /Category:Banks_of_S%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525A3o_Tom%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525A9_and_Pr%252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525C3%2525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252 Apr 12 06:31:23 10.0.2.233 apache2[23144]: [error] [client 208.80.152.87] Directory index forbidden by Options directive: /usr/local/apache/common/docroot/commons/w/ > I think we should keep our code conventions to trunk > to wmf as well, no notices, warnings or fatals should appear. Although since > we're just getting started on this, it makes sense to start with a filtered > output to existing channels, but an unfiltered output could be set as well. ie. > #wikimedia-debug or whatever. > > (unfiltered, not uncensored) > Reporting notices and warnings is fine, as long as the garbage mentioned above is filtered out. > > As you can see the DB servers in dberror.log and the srv servers in apache.log > > are stored as IP addresses, so you'd need to resolve those: > > Is this information available on noc.wikimedia.org as well ? There are some IPs > and names relations there, not sure if these can or should be there as well. > > > In mc.php there's 46 => '10.0.8.18:11000', No, this info is not available there, but this script would have to run on fenari or another host within the cluster anyway, so it can just do reverse DNS lookups.
(In reply to comment #7) > (In reply to comment #6) > > Why only fatals though ? > Because there's lots of garbage like this polluting the logs all the time: > > Apr 12 06:31:22 10.0.8.21 apache2[29754]: [error] [client 208.80.152.81] > Symbolic link not allowed or link target not accessible: > /usr/local/apache/common/docroot/meta/style, referer: http://cursilloswfla.org/ > Apr 12 06:31:22 10.0.8.21 apache2[29880]: [error] [client 208.80.152.71] > (36)File name too long: access to > /Category:Banks_of_S%252525252 > > > output to existing channels, but an unfiltered output could be set as well. ie. > > #wikimedia-debug or whatever. > > > > (unfiltered, not uncensored) > > > Reporting notices and warnings is fine, as long as the garbage mentioned above > is filtered out. Yeah, I forgot apache logs aren't just php's errors.
So, we've got: * Aggregated logs on the servers: https://wikitech.wikimedia.org/wiki/Logs * Gangla and graphite graphing some of these as numerical statistics, but no actual errors or trends. Needs one to open the logs for details. That's fine when working on a major exception spike (regression), but when trying to find minor notices and warnings not affecting everyone we need something else. translatewiki.net has an IRC bot echoing all these error logs, that's too much for us (at the very least we'd need to de-duplicate things). However I think it is should be feasible to develop something that monitors these, detects similar errors (similar to how we group them in fatalmonitor), and only report to IRC when new errors are first seen or errors seen earlier become significantly more common. We need to be careful about what is exposed, but all-in-all a nice web dashboard to show the details and an IRC bot to report trends and new ones could be quite useful. The web dashboard should probably not be written from scratch (perhaps use logstash), if it also has an API to query trends and new ones we can write an irc reporter off of that. This would either need to be run in production (proxied through fenari or whatever we do for things like graphite/gdash these days), or we'd need to replicate the necessary data to a wmflabs instance.