Last modified: 2014-05-28 13:29:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T66154, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 64154 - Replication for enwiki has stopped
Replication for enwiki has stopped
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Sean Pringle
:
Depends on:
Blocks: labs-replication
  Show dependency treegraph
 
Reported: 2014-04-20 14:01 UTC by Tim Landscheidt
Modified: 2014-05-28 13:29 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2014-04-20 14:01:20 UTC
Replication for enwiki stopped about two days ago:

| MariaDB [enwiki_p]> SELECT MAX(rc_timestamp) FROM recentchanges;
| +-------------------+
| | MAX(rc_timestamp) |
| +-------------------+
| | 20140418081351    |
| +-------------------+
| 1 row in set (0.01 sec)

| MariaDB [enwiki_p]>

Coren wrote in http://permalink.gmane.org/gmane.org.wikimedia.labs/2336:

| > Taking a look enwiki_p is at 1 day, 8:06:02 lag. I think its probably
| > due to someone having a broken request.

| > I know Coren will end up killing it, but it would be useful to know who
| > is causing these issues.

| Not this time; there were some system control statements issued in prod
| that cannot work on the replicas that have stalled the replication
| timeline.  This will need a bit of tender loving care from our DBA.
Comment 1 Sean Pringle 2014-04-21 03:33:31 UTC
labsdb1001 was stopped on a DROP USER statement where the upstream user did not exist locally. The statement has been skipped and replication is catching up.

Two related issues:

1. labsdb* replication is not using --repl-wild-ignore-tables=mysql.% and probably should.

2. The /usr/lib/nagios/percona/check_mysql_slave_running script is broken on labsdb* because it's passed a mysql socket argument that is ignored, making the connection fail (and for some reason that outcome doesn't count as critical...wtf)
Comment 2 Tim Landscheidt 2014-05-06 00:57:31 UTC
Replication for enwiki seems to have stopped again:

| MariaDB [enwiki_p]> SELECT MAX(rc_timestamp) FROM recentchanges;
| +-------------------+
| | MAX(rc_timestamp) |
| +-------------------+
| | 20140505180410    |
| +-------------------+
| 1 row in set (0.00 sec)

| MariaDB [enwiki_p]>

Is this related or a different issue?
Comment 3 Sean Pringle 2014-05-06 01:22:29 UTC
This blocked replication:

---TRANSACTION D3668010, ACTIVE 28563 sec fetching rows
mysql tables in use 3, locked 3
132316 lock struct(s), heap size 13384120, 1331972 row lock(s), undo log entries 29621
MySQL thread id 61759507, OS thread handle 0x7f698bf66700, query id 1194334863 10.68.1
DELETE FROM temp WHERE pid IN ( SELECT /* SLOW_OK LIMIT:2000 NM */ /* CATSCAN2 */ DIST
...
TOO MANY LOCKS PRINTED FOR THIS TRX: SUPPRESSING FURTHER PRINTS

Coren killed it.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links