Last modified: 2014-04-21 18:39:43 UTC
Icinga's replication checks are disabled for 6/7 analytics slaves. Let's get them turned on again, so Icinga alerts our team about lags again. Icinga shows the following relevant services disabled: * s1-analytics-slave.eqiad.wmnet (db1047.eqiad.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay * s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay ** MySQL Slave Running * s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay * s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet) <none> * s4-analytics-slave.eqiad.wmnet (db1017.eqiad.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay * s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay * s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet) ** MySQL Replication Heartbeat ** MySQL Slave Delay
It seems no one in our team knows why the alerts are disabled, so I pinged springle about it.
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1555
Discussion with springle showed that the Icinga alerts are turned off on purpose as the go off too often (due to slow queries run by analytics :-) ). Since a separate machine for slow queries is already on the way, springe suggested to wait for this machine, and once slow queries have been migrated over, we turn on Icinga alerts for the other machines again. Until then I'll have an eye on the lag and send out alerts if it gets too high.
Thanks for catching this, Christian! -Toby