Last modified: 2014-04-28 15:56:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58882, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 56882 - memcached-serious log flooded with TIMED RETRY errors


Summary:	memcached-serious log flooded with TIMED RETRY errors

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	wmf-deployment
Hardware:	All All

Importance:	High normal (vote)
Target Milestone:	---
Assigned To:	Tim Starling

URL:
Whiteboard:
Keywords:	platformeng

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-11-11 05:28 UTC by Ori Livneh
Modified:	2014-04-28 15:56 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Ori Livneh 2013-11-11 05:28:01 UTC

memcached-serious.log on fluorine receives about half a million of these every hour:

2013-11-11 05:23:25 mw36 enwiki: Memcached error for key "enwiki:resourceloader:filter:minify-css:7:5fa71091f95043d13fcefbea197f4233" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

It has been like that for several months now. It could be <https://bugs.php.net/bug.php?id=60049>: "Calling get with an invalid key and after that getMulti with a valid key does not work". This would explain why the keys that we're seeing in the log look sane; they are not the ones that trigger the failure. Could also be a twemproxy thing; it would be good to correlate the rate of errors with twemproxy deployment.

Comment 1 Ori Livneh 2013-11-11 06:00:25 UTC

Possibly related to <https://bugs.launchpad.net/libmemcached/+bug/928696>

Comment 2 Ori Livneh 2013-11-11 06:40:13 UTC

The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are preponderant in the logs and odd to behold. Possibly a different bug that is getting exposed by this one.

Comment 3 Tim Starling 2013-11-13 03:38:17 UTC

(In reply to comment #2)
> The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are
> preponderant in the logs and odd to behold. Possibly a different bug that is
> getting exposed by this one.

Those are LVS servers. Pybal is regularly requesting http://en.wikipedia.org/wiki/Main_Page on all backends, which is essentially the only MW traffic to the pmtpa apaches.

Comment 4 Tim Starling 2013-11-13 06:59:53 UTC

The error message indicates that a libmemcached memcached_connect() call gave a result of MEMCACHED_SERVER_TEMPORARILY_DISABLED. This can happen if memcached_mark_server_for_timeout() was called on the server, which can happen if memcached_quit_server is called with io_death=true, which can happen in all sorts of different cases in io.cc and in one case in response.cc. More investigation is needed to determine exactly which case is causing it. 

Unfortunately, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT (i.e. retry_timeout in our config) has a minimum of one whole second. In the case of PHP talking to twemproxy, immediate reconnection would probably be a better policy. The one-second timeout is why we see floods of messages in bursts that last approximately one second each.

The fact that pmtpa apaches responding to pybal monitoring requests are heavily represented in the logs may be caused by transient packet loss on the pmtpa to eqiad link, which would increase the rate of connection timeouts.

Comment 5 Tim Starling 2013-11-13 22:35:38 UTC

Here's an aggregation of error messages in the current memcached-serious.log:

1  ITEM TOO BIG
2081  CONNECTION FAILURE
8199483  SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
9  SERVER ERROR
101  A TIMEOUT OCCURRED
3  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE

If you exclude TIMED RETRY messages, you get only 27 different appservers, whereas there are 387 appservers in the log overall. CONNECTION FAILURE and A TIMEOUT OCCURRED can lead to a flood of TIMED RETRY messages afterwards, but obviously they can't account for most of the TIMED RETRY messages. In fact, all 2081 CONNECTION FAILURE messages in this log came from snapshot3. So we need to look for unlogged causes of servers being marked for timed retry.

Comment 6 Andre Klapper 2013-11-14 10:19:46 UTC

[Tim: Your browser seems to reset the Priority field.]

Comment 7 Tim Starling 2013-11-15 00:47:16 UTC

A potential workaround is to reduce the retry timeout to zero. I have written a libmemcached patch which allows this. Faidon will deploy it. The patch has been submitted upstream at <https://bugs.launchpad.net/libmemcached/+bug/1251482>

Comment 8 Tim Starling 2013-11-15 00:49:50 UTC

(In reply to comment #6)
> [Tim: Your browser seems to reset the Priority field.]

Maybe priority changes don't trigger the "mid-air collision" page.

Comment 9 Aaron Schulz 2014-02-14 02:35:28 UTC

I set the timeout to -1 on production...which is kind of hacky looking at the C code, but should work for now. Tim's upstream patch seems to be merged as well.

Comment 10 Gerrit Notification Bot 2014-02-14 19:05:09 UTC

Change 113387 had a related patch set uploaded by Aaron Schulz:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387

Comment 11 Gerrit Notification Bot 2014-02-14 19:07:34 UTC

Change 113387 merged by jenkins-bot:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387

Comment 12 Aaron Schulz 2014-02-14 19:50:20 UTC

(In reply to Gerrit Notification Bot from comment #10)
> Change 113387 had a related patch set uploaded by Aaron Schulz:
> Set retry_timeout to -1 for memcached in eqiad only
> 
> https://gerrit.wikimedia.org/r/113387

This doesn't work for tampa, so most errors will remain. I guess they can probably be ignored.

Comment 13 Andre Klapper 2014-02-25 15:41:42 UTC

Patch merged for Eqiad; somebody needs to decide if it's worth to do something about Tampa or not (and close the ticket accordingly).

Comment 14 Andre Klapper 2014-04-25 06:53:26 UTC

All patches merged; resetting ticket status

(In reply to Andre Klapper from comment #13)
> Patch merged for Eqiad; somebody needs to decide if it's worth to do
> something about Tampa or not (and close the ticket accordingly).

Comment 15 Sam Reed (reedy) 2014-04-28 15:56:22 UTC

Tampa is deaded

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links