Last modified: 2012-02-02 14:14:29 UTC
On pl.wikipedia.org from 05:58:45 onwards, SiteStatsInit::refresh() began to be called several times per second. It's not known at this stage why SiteStats::isSane() returned false. The binlog shows that the refresh() queries were often executed in autocommit mode, meaning that the DELETE query was committed before the INSERT query began. This would have caused isSane() to return false until the new row insert was committed, leading to a flood of attempted refreshes. Eventually, a flood of SELECT COUNT(*) queries at around 07:10 caused an overload on all s2 slaves, leading to an overload of the apache pool and site-wide downtime. SiteStatsInit was disabled and all related queries were killed. When the dust settled, the site_stats row was missing, and had to be recovered from binlogs. I suggest removing the isSane() checks from loadAndLazyInit(), and doing a refresh only from maintenance scripts or web-based upgrade. SiteStats::load() should be able to tolerate a missing site_stats row, and the accessor functions should return false without giving a PHP warning. Additionally, the refresh should be done with REPLACE instead of DELETE and INSERT.