Last modified: 2014-05-28 13:29:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T66154, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 64154 - Replication for enwiki has stopped


Summary:	Replication for enwiki has stopped

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	tools (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High normal
Target Milestone:	---
Assigned To:	Sean Pringle

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	labs-replication
	Show dependency tree / graph

Reported:	2014-04-20 14:01 UTC by Tim Landscheidt
Modified:	2014-05-28 13:29 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tim Landscheidt 2014-04-20 14:01:20 UTC

Replication for enwiki stopped about two days ago:

| MariaDB [enwiki_p]> SELECT MAX(rc_timestamp) FROM recentchanges;
| +-------------------+
| | MAX(rc_timestamp) |
| +-------------------+
| | 20140418081351    |
| +-------------------+
| 1 row in set (0.01 sec)

| MariaDB [enwiki_p]>

Coren wrote in http://permalink.gmane.org/gmane.org.wikimedia.labs/2336:

| > Taking a look enwiki_p is at 1 day, 8:06:02 lag. I think its probably
| > due to someone having a broken request.

| > I know Coren will end up killing it, but it would be useful to know who
| > is causing these issues.

| Not this time; there were some system control statements issued in prod
| that cannot work on the replicas that have stalled the replication
| timeline.  This will need a bit of tender loving care from our DBA.

Comment 1 Sean Pringle 2014-04-21 03:33:31 UTC

labsdb1001 was stopped on a DROP USER statement where the upstream user did not exist locally. The statement has been skipped and replication is catching up.

Two related issues:

1. labsdb* replication is not using --repl-wild-ignore-tables=mysql.% and probably should.

2. The /usr/lib/nagios/percona/check_mysql_slave_running script is broken on labsdb* because it's passed a mysql socket argument that is ignored, making the connection fail (and for some reason that outcome doesn't count as critical...wtf)

Comment 2 Tim Landscheidt 2014-05-06 00:57:31 UTC

Replication for enwiki seems to have stopped again:

| MariaDB [enwiki_p]> SELECT MAX(rc_timestamp) FROM recentchanges;
| +-------------------+
| | MAX(rc_timestamp) |
| +-------------------+
| | 20140505180410    |
| +-------------------+
| 1 row in set (0.00 sec)

| MariaDB [enwiki_p]>

Is this related or a different issue?

Comment 3 Sean Pringle 2014-05-06 01:22:29 UTC

This blocked replication:

---TRANSACTION D3668010, ACTIVE 28563 sec fetching rows
mysql tables in use 3, locked 3
132316 lock struct(s), heap size 13384120, 1331972 row lock(s), undo log entries 29621
MySQL thread id 61759507, OS thread handle 0x7f698bf66700, query id 1194334863 10.68.1
DELETE FROM temp WHERE pid IN ( SELECT /* SLOW_OK LIMIT:2000 NM */ /* CATSCAN2 */ DIST
...
TOO MANY LOCKS PRINTED FOR THIS TRX: SUPPRESSING FURTHER PRINTS

Coren killed it.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links