Last modified: 2013-10-14 04:06:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48428, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46428 - mwscript.php/mctest.php does not know about memcache in both datacenters
mwscript.php/mctest.php does not know about memcache in both datacenters
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: High critical (vote)
: ---
Assigned To: Nobody - You can work on this!
deploysprint-13
: platformeng
Depends on:
Blocks: 43421 46536
  Show dependency treegraph
 
Reported: 2013-03-21 18:43 UTC by Leslie Carr
Modified: 2013-10-14 04:06 UTC (History)
19 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Leslie Carr 2013-03-21 18:43:03 UTC
(from RT #4790) hume does not know about the eqiad memcache cluster, at least when running mwscript.  Thus, the two memcache clusters get out of sync when running maintenance scripts on hume.



The real problem is that memcache maintenance is not location aware.  As we move into a multi-datacenter model, we need to have the option to look at one or the other memcache instance, or (even better) option for all. 

In my head it would be something like 

mwscript mctest.php --eqiad 
mwscript mctest.php --sdtpa
mwscript mctest.php --all
Comment 1 Roan Kattouw 2013-03-21 18:47:28 UTC
This isn't just about mctest.php . Regular maintenance scripts that execute memcached insertions/deletions should also hit both locations.
Comment 2 Brion Vibber 2013-03-21 19:01:29 UTC
How're the memcache clusters arranged exactly? I'm worried this could lead to split-brain problems.
Comment 3 Leslie Carr 2013-03-21 19:13:28 UTC
They're configured here -  
mediawiki-config/wmf-config/mc-eqiad.php
mediawiki-config/wmf-config/mc-pmtpa.php
https://wikitech.wikimedia.org/wiki/Memcached isactually up to date :)
Comment 4 Leslie Carr 2013-03-21 19:14:45 UTC
Oh and this also has definitely affected things, with memcache being unpurged for some fundraising purposes (how i discovered this).  So tampa and eqiad have different behavior for specifically  centralauth-user-05ef03b8cf2f9261df5f7f52c7ec7b65
Comment 5 Matthew Flaschen 2013-03-21 19:21:59 UTC
Is this also an issue for Redis?
Comment 6 Peter Gehres 2013-03-21 19:33:11 UTC
(In reply to comment #4)
> Oh and this also has definitely affected things, with memcache being unpurged
> for some fundraising purposes (how i discovered this). 

AFAIK, this does not affect fundraising currently, but I do know that fundraising has some plans for memcache and I assume they would want the same functionality between the payments clusters.
Comment 7 Matt Walker 2013-03-22 05:52:58 UTC
Not exactly -- we will be using memcache for session storage and continue our use of it for fraud analytic. In both cases; I'm not seeing a serious need for us to sync across clusters at this time -- we only ever have one active cluster and if we lose messages in flight my opinion is 'oh well; we can pick them up in the audit'.
Comment 8 Brion Vibber 2013-03-25 21:12:22 UTC
Ok, sounds like the proper fix is:

1) forbid running mwscript on fenari or elsewhere in pmtpa while eqiad is primary, because IT CAN CAUSE BREAKAGE TO DO SO (split-brain memcache, bad cached items possibly being reinserted to databases)

2) set up someplace in eqiad where mwscript can be run, so that maintenance scripts can be run WITHOUT BREAKAGE
Comment 9 Leslie Carr 2013-03-27 17:08:05 UTC
Brion - I have to disagree with you there.  

I think that in a multidatacenter environment (which even if we don't really have it, we should strive to have all our code pretending that we do), the location where you run a script should be unimportant (since we all know someone will mess up somehow) and the script should either have the knowledge of the location or prompt for the input of location (or just clear from all locations).
Comment 10 Aaron Schulz 2013-03-27 17:10:06 UTC
(In reply to comment #9)
> Brion - I have to disagree with you there.  
> 
> I think that in a multidatacenter environment (which even if we don't really
> have it, we should strive to have all our code pretending that we do), the
> location where you run a script should be unimportant (since we all know
> someone will mess up somehow) and the script should either have the knowledge
> of the location or prompt for the input of location (or just clear from all
> locations).

As brion suggested, it could be disallowed (the wrapper scripts could actually check for this and error out).
Comment 11 Fabrice Florin 2013-04-01 20:01:16 UTC
Hi folks, we'd be grateful if this issue could be resolved soon, as it is currently blocking the limited re-deployment of Article Feedback 5 on en-wiki, where the tool has now been completely unavailable for over two weeks. Is there anything which Matthias and our E2 team could do to help a prompt resolution? Thanks in advance for any help you can provide towards that goal. :)
Comment 12 Terry Chay 2013-04-01 21:07:00 UTC
(In reply to comment #5)
> Is this also an issue for Redis?

I don't believe so. Redis should be multi-datacenter aware.
Comment 13 Andre Klapper 2013-04-02 16:26:50 UTC
Was told that this should probably wait until the Redis switchover, which will happen either this week or on Wed 10th of this month.
Comment 14 Fabrice Florin 2013-04-04 20:55:39 UTC
Thanks for the update, André. Do you know when we  realistically expect to solve this issue? Who would be responsible for making and deploying the necessary revision? Can we do anything to help make this happen sooner rather than lager? I would like to tell the folks on English Wikipedia, who have been patiently waiting for the AFT5 tool to be re-enabled for nearly 3 weeks now, even though we promised it would be back up 2 weeks ago. :(
Comment 15 Aaron Schulz 2013-04-08 18:28:26 UTC
(In reply to comment #13)
> Was told that this should probably wait until the Redis switchover, which
> will
> happen either this week or on Wed 10th of this month.

What switchover? The idea of using Redis along for caching was abandoned last year due to it's LRU strategy. If you are referring the job queue, I don't those two being related.

I've been trying to ping people about getting Terbium in a usable state. The directory permissions are still broken. It shouldn't be hard to get that working. At that point, MWScript will be usable, though the more convenient mwscript will still need some work since there is no /home directory.
Comment 16 Aaron Schulz 2013-04-08 19:44:02 UTC
MWScript.php and mwscript are working now after some permission fixes by Peter and after https://gerrit.wikimedia.org/r/#/c/58124/, so I think terbium starting to get usable. I'll probably start running some scripts on it today myself.
Comment 17 Rob Lanphier 2013-04-09 19:03:27 UTC
Lowering priority of this bug, and unassigning from Aaron.  We're cooking up a plan to generally have a deployment tools-related sprint in another quarter or two, but right now, the urgent priority is getting migrated from hume to terbium, so that we stop trying to use hume for eqiad updates.

See https://www.mediawiki.org/wiki/Site_performance_and_architecture#Roadmap for more context.
Comment 18 Rob Halsell 2013-06-19 18:14:36 UTC
I'm not sure why this is being lowered, terbium coming online doesn't fix the fact the script isn't location aware.

This still needs to be fixed, and I would assume by development.

RobLa: Would anyone in dev over there handle making this location aware?  We need this for proper monitoring.
Comment 19 Tim Starling 2013-10-10 01:07:24 UTC
I don't understand why you would want a concept of per-DC memcached clusters in MediaWiki, when there is no replication. We're not planning on solving the split-brain problem at the application level. As far as I'm concerned, every actual bug was fixed in I5d64cec2 and Ib327d713 and so this can be closed.

(In reply to comment #9)
> I think that in a multidatacenter environment (which even if we don't really
> have it, we should strive to have all our code pretending that we do), the
> location where you run a script should be unimportant (since we all know
> someone will mess up somehow) and the script should either have the knowledge
> of the location or prompt for the input of location (or just clear from all
> locations).

Since I5d64cec2/Ib327d713, it doesn't matter where you run a script. So this is fixed, isn't it?
Comment 20 Tim Starling 2013-10-14 04:06:48 UTC
I will take your collective silence to mean yes.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links