Last modified: 2014-10-07 00:24:48 UTC
From 2014-10-05's SAL [1] 20:08 Nemo_bis: 22.03 < Ainali> It was just noticed on svwp village pump that http://stats.wikimedia.org is down I checked, and apache is currently not running on stat1001 (although it should). Hence, all it's configured sites are not available. This includes stats.wikimedia.org datasets.wikimedia.org stat1001's dmesg showed 6 messages about limn-reportcard respawning too fast every 20 minutes (puppet run?) until 2014-10-04 17:45. Might be that things broke around that time. Icinga shows CRITICAL for the "puppet last run” service. (But the service is currently muted. Anyone know why?) [1] https://wikitech.wikimedia.org/wiki/Server_Admin_Log
This ticket needs Ops power. I filed RT #8554 for it.
https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Miscellaneous+eqiad&h=stat1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS showa that at some point 1 GB memory was freed and then traffic dropped. Hoo concluded that apache2 died and puppet doesn't configure the machine to restart it.
Change 164914 had a related patch set uploaded by QChris: End stats.wikimedia.org certificate in newline https://gerrit.wikimedia.org/r/164914
Change 164914 merged by Filippo Giunchedi: End stats.wikimedia.org certificate in newline https://gerrit.wikimedia.org/r/164914
godog restarted apache on stat1001. https://stats.wikimedia.org/ https://datasets.wikimedia.org/ are working again. It seems certificate chaining choked on stats.wikimedia.org's certificate not ending in a newline. Stop-gap fix is in commet #3. But godog and _joe_ said this setting should be caught by the certificate chaining itself, which makes sense. The RT ticket has been updated accordingly. Thanks godog and _joe_!
Created attachment 16683 [details] Page request that match "undefined" from the october sample logs so far. Page request that match "undefined" from the october sample logs so far.
Please ignore prior assignment.