Last modified: 2014-04-28 15:56:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58882, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 56882 - memcached-serious log flooded with TIMED RETRY errors
memcached-serious log flooded with TIMED RETRY errors
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Tim Starling
: platformeng
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-11 05:28 UTC by Ori Livneh
Modified: 2014-04-28 15:56 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ori Livneh 2013-11-11 05:28:01 UTC
memcached-serious.log on fluorine receives about half a million of these every hour:

2013-11-11 05:23:25 mw36 enwiki: Memcached error for key "enwiki:resourceloader:filter:minify-css:7:5fa71091f95043d13fcefbea197f4233" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

It has been like that for several months now. It could be <https://bugs.php.net/bug.php?id=60049>: "Calling get with an invalid key and after that getMulti with a valid key does not work". This would explain why the keys that we're seeing in the log look sane; they are not the ones that trigger the failure. Could also be a twemproxy thing; it would be good to correlate the rate of errors with twemproxy deployment.
Comment 1 Ori Livneh 2013-11-11 06:00:25 UTC
Possibly related to <https://bugs.launchpad.net/libmemcached/+bug/928696>
Comment 2 Ori Livneh 2013-11-11 06:40:13 UTC
The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are preponderant in the logs and odd to behold. Possibly a different bug that is getting exposed by this one.
Comment 3 Tim Starling 2013-11-13 03:38:17 UTC
(In reply to comment #2)
> The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are
> preponderant in the logs and odd to behold. Possibly a different bug that is
> getting exposed by this one.

Those are LVS servers. Pybal is regularly requesting http://en.wikipedia.org/wiki/Main_Page on all backends, which is essentially the only MW traffic to the pmtpa apaches.
Comment 4 Tim Starling 2013-11-13 06:59:53 UTC
The error message indicates that a libmemcached memcached_connect() call gave a result of MEMCACHED_SERVER_TEMPORARILY_DISABLED. This can happen if memcached_mark_server_for_timeout() was called on the server, which can happen if memcached_quit_server is called with io_death=true, which can happen in all sorts of different cases in io.cc and in one case in response.cc. More investigation is needed to determine exactly which case is causing it. 

Unfortunately, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT (i.e. retry_timeout in our config) has a minimum of one whole second. In the case of PHP talking to twemproxy, immediate reconnection would probably be a better policy. The one-second timeout is why we see floods of messages in bursts that last approximately one second each.

The fact that pmtpa apaches responding to pybal monitoring requests are heavily represented in the logs may be caused by transient packet loss on the pmtpa to eqiad link, which would increase the rate of connection timeouts.
Comment 5 Tim Starling 2013-11-13 22:35:38 UTC
Here's an aggregation of error messages in the current memcached-serious.log:

1  ITEM TOO BIG
2081  CONNECTION FAILURE
8199483  SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
9  SERVER ERROR
101  A TIMEOUT OCCURRED
3  A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE

If you exclude TIMED RETRY messages, you get only 27 different appservers, whereas there are 387 appservers in the log overall. CONNECTION FAILURE and A TIMEOUT OCCURRED can lead to a flood of TIMED RETRY messages afterwards, but obviously they can't account for most of the TIMED RETRY messages. In fact, all 2081 CONNECTION FAILURE messages in this log came from snapshot3. So we need to look for unlogged causes of servers being marked for timed retry.
Comment 6 Andre Klapper 2013-11-14 10:19:46 UTC
[Tim: Your browser seems to reset the Priority field.]
Comment 7 Tim Starling 2013-11-15 00:47:16 UTC
A potential workaround is to reduce the retry timeout to zero. I have written a libmemcached patch which allows this. Faidon will deploy it. The patch has been submitted upstream at <https://bugs.launchpad.net/libmemcached/+bug/1251482>
Comment 8 Tim Starling 2013-11-15 00:49:50 UTC
(In reply to comment #6)
> [Tim: Your browser seems to reset the Priority field.]

Maybe priority changes don't trigger the "mid-air collision" page.
Comment 9 Aaron Schulz 2014-02-14 02:35:28 UTC
I set the timeout to -1 on production...which is kind of hacky looking at the C code, but should work for now. Tim's upstream patch seems to be merged as well.
Comment 10 Gerrit Notification Bot 2014-02-14 19:05:09 UTC
Change 113387 had a related patch set uploaded by Aaron Schulz:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387
Comment 11 Gerrit Notification Bot 2014-02-14 19:07:34 UTC
Change 113387 merged by jenkins-bot:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387
Comment 12 Aaron Schulz 2014-02-14 19:50:20 UTC
(In reply to Gerrit Notification Bot from comment #10)
> Change 113387 had a related patch set uploaded by Aaron Schulz:
> Set retry_timeout to -1 for memcached in eqiad only
> 
> https://gerrit.wikimedia.org/r/113387

This doesn't work for tampa, so most errors will remain. I guess they can probably be ignored.
Comment 13 Andre Klapper 2014-02-25 15:41:42 UTC
Patch merged for Eqiad; somebody needs to decide if it's worth to do something about Tampa or not (and close the ticket accordingly).
Comment 14 Andre Klapper 2014-04-25 06:53:26 UTC
All patches merged; resetting ticket status

(In reply to Andre Klapper from comment #13)
> Patch merged for Eqiad; somebody needs to decide if it's worth to do
> something about Tampa or not (and close the ticket accordingly).
Comment 15 Sam Reed (reedy) 2014-04-28 15:56:22 UTC
Tampa is deaded

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links