Last modified: 2014-04-28 15:56:22 UTC
memcached-serious.log on fluorine receives about half a million of these every hour: 2013-11-11 05:23:25 mw36 enwiki: Memcached error for key "enwiki:resourceloader:filter:minify-css:7:5fa71091f95043d13fcefbea197f4233" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY It has been like that for several months now. It could be <https://bugs.php.net/bug.php?id=60049>: "Calling get with an invalid key and after that getMulti with a valid key does not work". This would explain why the keys that we're seeing in the log look sane; they are not the ones that trigger the failure. Could also be a twemproxy thing; it would be good to correlate the rate of errors with twemproxy deployment.
Possibly related to <https://bugs.launchpad.net/libmemcached/+bug/928696>
The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are preponderant in the logs and odd to behold. Possibly a different bug that is getting exposed by this one.
(In reply to comment #2) > The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are > preponderant in the logs and odd to behold. Possibly a different bug that is > getting exposed by this one. Those are LVS servers. Pybal is regularly requesting http://en.wikipedia.org/wiki/Main_Page on all backends, which is essentially the only MW traffic to the pmtpa apaches.
The error message indicates that a libmemcached memcached_connect() call gave a result of MEMCACHED_SERVER_TEMPORARILY_DISABLED. This can happen if memcached_mark_server_for_timeout() was called on the server, which can happen if memcached_quit_server is called with io_death=true, which can happen in all sorts of different cases in io.cc and in one case in response.cc. More investigation is needed to determine exactly which case is causing it. Unfortunately, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT (i.e. retry_timeout in our config) has a minimum of one whole second. In the case of PHP talking to twemproxy, immediate reconnection would probably be a better policy. The one-second timeout is why we see floods of messages in bursts that last approximately one second each. The fact that pmtpa apaches responding to pybal monitoring requests are heavily represented in the logs may be caused by transient packet loss on the pmtpa to eqiad link, which would increase the rate of connection timeouts.
Here's an aggregation of error messages in the current memcached-serious.log: 1 ITEM TOO BIG 2081 CONNECTION FAILURE 8199483 SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY 9 SERVER ERROR 101 A TIMEOUT OCCURRED 3 A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE If you exclude TIMED RETRY messages, you get only 27 different appservers, whereas there are 387 appservers in the log overall. CONNECTION FAILURE and A TIMEOUT OCCURRED can lead to a flood of TIMED RETRY messages afterwards, but obviously they can't account for most of the TIMED RETRY messages. In fact, all 2081 CONNECTION FAILURE messages in this log came from snapshot3. So we need to look for unlogged causes of servers being marked for timed retry.
[Tim: Your browser seems to reset the Priority field.]
A potential workaround is to reduce the retry timeout to zero. I have written a libmemcached patch which allows this. Faidon will deploy it. The patch has been submitted upstream at <https://bugs.launchpad.net/libmemcached/+bug/1251482>
(In reply to comment #6) > [Tim: Your browser seems to reset the Priority field.] Maybe priority changes don't trigger the "mid-air collision" page.
I set the timeout to -1 on production...which is kind of hacky looking at the C code, but should work for now. Tim's upstream patch seems to be merged as well.
Change 113387 had a related patch set uploaded by Aaron Schulz: Set retry_timeout to -1 for memcached in eqiad only https://gerrit.wikimedia.org/r/113387
Change 113387 merged by jenkins-bot: Set retry_timeout to -1 for memcached in eqiad only https://gerrit.wikimedia.org/r/113387
(In reply to Gerrit Notification Bot from comment #10) > Change 113387 had a related patch set uploaded by Aaron Schulz: > Set retry_timeout to -1 for memcached in eqiad only > > https://gerrit.wikimedia.org/r/113387 This doesn't work for tampa, so most errors will remain. I guess they can probably be ignored.
Patch merged for Eqiad; somebody needs to decide if it's worth to do something about Tampa or not (and close the ticket accordingly).
All patches merged; resetting ticket status (In reply to Andre Klapper from comment #13) > Patch merged for Eqiad; somebody needs to decide if it's worth to do > something about Tampa or not (and close the ticket accordingly).
Tampa is deaded