Last modified: 2014-09-03 18:32:05 UTC
We should figure out exactly what is going on. For example, I saw this: 2014-03-07 02:40:46 mw1015 wikidatawiki: Update for doc ids: 15087781 2014-03-07 02:40:46 mw1008 wikidatawiki: Update for doc ids: 15087781 2014-03-07 02:40:46 mw1008 wikidatawiki: Update for doc ids: 15087781 These might just be trying to update the same document concurrently. Chad was talking about just retrying here. If we're trying to update the same document multiple times we could do that. We also might want to use the pool counter to prevent it. We could probably use the shared acquire to notice that another job tried to update then, well, not do it. On problem, though, is that updating a page will get a parser lock (I think) so we have to make sure not to lock eachother. I think.
We do retry now as of Gerrit change #117335.
Cool! I guess I was just going on faith that we were retrying....
I just checked and saw two things: 1. If we try to send 50 updates all at once we might bump against an Elasticsearch queue limit. I'm chunking it to 10 at a time. 2. I _think_ moving a page and leaving behind a redirect can cause that version conflict error. I believe it makes two jobs - one for the new page and one for the redirect. We need to keep the redirect job but might be able to throw away the one for the new page. Worth checking.
Change 148405 had a related patch set uploaded by Manybubbles: Chunk updates at 10 https://gerrit.wikimedia.org/r/148405
Finishing up skipping the second update in point #2 from comment 3.
Change 148417 had a related patch set uploaded by Manybubbles: On article move only use one job https://gerrit.wikimedia.org/r/148417
Change 148405 merged by jenkins-bot: Chunk updates at 10 https://gerrit.wikimedia.org/r/148405
Change 148417 merged by jenkins-bot: On article move only use one job https://gerrit.wikimedia.org/r/148417
Shifting back to new - we'll have to reevaluate in two weeks or so once these changes hit production and we've churned through the queue.
Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.