Last modified: 2014-08-21 13:07:23 UTC
on 2014-08-19 a PoolWorker had MySQL connection [1] issues during the process_data.py job, and that took the whole job down. Since stat1003 had its distupgrade on the same day, maybe those two things related? [1] File "/srv/geowiki/scripts/geowiki/process_data.py", line 388, in <module> main() File "/srv/geowiki/scripts/geowiki/process_data.py", line 379, in main run_parallel(opts) File "/srv/geowiki/scripts/geowiki/process_data.py", line 47, in run_parallel p.map(partial_process_project, opts['wp_projects']) File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get raise self._value _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
(In reply to christian from comment #0) > Since stat1003 had its distupgrade on the same day, maybe those two things > related? The stat1003 distupgrade looks unrelated. The failure happened before the distupgrade started. So it's merely a coincidence. > _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server > during query') The relevant connection was to dbstore1002, and tendril shows >40 aborted connections for dbstore1002 around that time and the following few hours. There was also an Icinga for dbstore1002 a bit later due to a socket timeout. So it might be that the issue was on dbstore1002's side. Today's run passed without problems and also produced the data for yesterday's run. So it seems to have been a fluke around dbstore1002. If tomorrow's run passes too, I'll close the bug.
Todays run again passed without issues. Discussing the issue with springle, there was a backup job a bit before the geowiki run. That run spiked a few graphs, but springle said that there is no immediately obvious reason, why it would affect the client connections. The docs at https://dev.mysql.com/doc/refman/5.5/en/error-lost-connection.html also point a bit at networking issues. Due to lack of evidence, and since it is the first time we saw this error in that form, and since also the two most recent runs passed without issues, I'll write it off as fluke for now.