Last modified: 2014-08-21 13:07:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71812, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69812 - geowiki data aggregation failed on 2014-08-19
geowiki data aggregation failed on 2014-08-19
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: christian
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-20 20:47 UTC by christian
Modified: 2014-08-21 13:07 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-08-20 20:47:27 UTC
on 2014-08-19 a PoolWorker had MySQL connection [1] issues during the process_data.py job, and that took the whole job down.

Since stat1003 had its distupgrade on the same day, maybe those two things
related?




[1]
  File "/srv/geowiki/scripts/geowiki/process_data.py", line 388, in <module>
    main()
  File "/srv/geowiki/scripts/geowiki/process_data.py", line 379, in main
    run_parallel(opts)
  File "/srv/geowiki/scripts/geowiki/process_data.py", line 47, in run_parallel
    p.map(partial_process_project, opts['wp_projects'])
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
    raise self._value
_mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
Comment 1 christian 2014-08-20 22:44:44 UTC
(In reply to christian from comment #0)
> Since stat1003 had its distupgrade on the same day, maybe those two things
> related?

The stat1003 distupgrade looks unrelated.
The failure happened before the distupgrade started.
So it's merely a coincidence.

> _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server
> during query')

The relevant connection was to dbstore1002, and tendril shows >40
aborted connections for dbstore1002 around that time and the following
few hours. There was also an Icinga for dbstore1002 a bit later due to
a socket timeout. So it might be that the issue was on dbstore1002's side.

Today's run passed without problems and also produced the data for
yesterday's run. So it seems to have been a fluke around dbstore1002.

If tomorrow's run passes too, I'll close the bug.
Comment 2 christian 2014-08-21 13:07:23 UTC
Todays run again passed without issues.

Discussing the issue with springle, there was a backup job a bit
before the geowiki run. That run spiked a few graphs, but springle said
that there is no immediately obvious reason, why it would affect the
client connections.

The docs at
  https://dev.mysql.com/doc/refman/5.5/en/error-lost-connection.html
also point a bit at networking issues.

Due to lack of evidence, and since it is the first time we saw this
error in that form, and since also the two most recent runs passed
without issues, I'll write it off as fluke for now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links