Last modified: 2013-10-14 04:06:48 UTC
(from RT #4790) hume does not know about the eqiad memcache cluster, at least when running mwscript. Thus, the two memcache clusters get out of sync when running maintenance scripts on hume. The real problem is that memcache maintenance is not location aware. As we move into a multi-datacenter model, we need to have the option to look at one or the other memcache instance, or (even better) option for all. In my head it would be something like mwscript mctest.php --eqiad mwscript mctest.php --sdtpa mwscript mctest.php --all
This isn't just about mctest.php . Regular maintenance scripts that execute memcached insertions/deletions should also hit both locations.
How're the memcache clusters arranged exactly? I'm worried this could lead to split-brain problems.
They're configured here - mediawiki-config/wmf-config/mc-eqiad.php mediawiki-config/wmf-config/mc-pmtpa.php https://wikitech.wikimedia.org/wiki/Memcached isactually up to date :)
Oh and this also has definitely affected things, with memcache being unpurged for some fundraising purposes (how i discovered this). So tampa and eqiad have different behavior for specifically centralauth-user-05ef03b8cf2f9261df5f7f52c7ec7b65
Is this also an issue for Redis?
(In reply to comment #4) > Oh and this also has definitely affected things, with memcache being unpurged > for some fundraising purposes (how i discovered this). AFAIK, this does not affect fundraising currently, but I do know that fundraising has some plans for memcache and I assume they would want the same functionality between the payments clusters.
Not exactly -- we will be using memcache for session storage and continue our use of it for fraud analytic. In both cases; I'm not seeing a serious need for us to sync across clusters at this time -- we only ever have one active cluster and if we lose messages in flight my opinion is 'oh well; we can pick them up in the audit'.
Ok, sounds like the proper fix is: 1) forbid running mwscript on fenari or elsewhere in pmtpa while eqiad is primary, because IT CAN CAUSE BREAKAGE TO DO SO (split-brain memcache, bad cached items possibly being reinserted to databases) 2) set up someplace in eqiad where mwscript can be run, so that maintenance scripts can be run WITHOUT BREAKAGE
Brion - I have to disagree with you there. I think that in a multidatacenter environment (which even if we don't really have it, we should strive to have all our code pretending that we do), the location where you run a script should be unimportant (since we all know someone will mess up somehow) and the script should either have the knowledge of the location or prompt for the input of location (or just clear from all locations).
(In reply to comment #9) > Brion - I have to disagree with you there. > > I think that in a multidatacenter environment (which even if we don't really > have it, we should strive to have all our code pretending that we do), the > location where you run a script should be unimportant (since we all know > someone will mess up somehow) and the script should either have the knowledge > of the location or prompt for the input of location (or just clear from all > locations). As brion suggested, it could be disallowed (the wrapper scripts could actually check for this and error out).
Hi folks, we'd be grateful if this issue could be resolved soon, as it is currently blocking the limited re-deployment of Article Feedback 5 on en-wiki, where the tool has now been completely unavailable for over two weeks. Is there anything which Matthias and our E2 team could do to help a prompt resolution? Thanks in advance for any help you can provide towards that goal. :)
(In reply to comment #5) > Is this also an issue for Redis? I don't believe so. Redis should be multi-datacenter aware.
Was told that this should probably wait until the Redis switchover, which will happen either this week or on Wed 10th of this month.
Thanks for the update, André. Do you know when we realistically expect to solve this issue? Who would be responsible for making and deploying the necessary revision? Can we do anything to help make this happen sooner rather than lager? I would like to tell the folks on English Wikipedia, who have been patiently waiting for the AFT5 tool to be re-enabled for nearly 3 weeks now, even though we promised it would be back up 2 weeks ago. :(
(In reply to comment #13) > Was told that this should probably wait until the Redis switchover, which > will > happen either this week or on Wed 10th of this month. What switchover? The idea of using Redis along for caching was abandoned last year due to it's LRU strategy. If you are referring the job queue, I don't those two being related. I've been trying to ping people about getting Terbium in a usable state. The directory permissions are still broken. It shouldn't be hard to get that working. At that point, MWScript will be usable, though the more convenient mwscript will still need some work since there is no /home directory.
MWScript.php and mwscript are working now after some permission fixes by Peter and after https://gerrit.wikimedia.org/r/#/c/58124/, so I think terbium starting to get usable. I'll probably start running some scripts on it today myself.
Lowering priority of this bug, and unassigning from Aaron. We're cooking up a plan to generally have a deployment tools-related sprint in another quarter or two, but right now, the urgent priority is getting migrated from hume to terbium, so that we stop trying to use hume for eqiad updates. See https://www.mediawiki.org/wiki/Site_performance_and_architecture#Roadmap for more context.
I'm not sure why this is being lowered, terbium coming online doesn't fix the fact the script isn't location aware. This still needs to be fixed, and I would assume by development. RobLa: Would anyone in dev over there handle making this location aware? We need this for proper monitoring.
I don't understand why you would want a concept of per-DC memcached clusters in MediaWiki, when there is no replication. We're not planning on solving the split-brain problem at the application level. As far as I'm concerned, every actual bug was fixed in I5d64cec2 and Ib327d713 and so this can be closed. (In reply to comment #9) > I think that in a multidatacenter environment (which even if we don't really > have it, we should strive to have all our code pretending that we do), the > location where you run a script should be unimportant (since we all know > someone will mess up somehow) and the script should either have the knowledge > of the location or prompt for the input of location (or just clear from all > locations). Since I5d64cec2/Ib327d713, it doesn't matter where you run a script. So this is fixed, isn't it?
I will take your collective silence to mean yes.