Last modified: 2013-12-04 18:46:19 UTC
Somehow I have the effect, that after importing pages from another wiki with Special:Import my apache and mysql take the CPU 50:50 and only work normaly again after restarting mysql. Making strace on the apache prozess with 50% CPU shows always the same: poll([{fd=82, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(82, "\370\0\0\0\3UPDATE /* JobQueueDB::claim"..., 252) = 252 read(82, "0\0\0\1\0\0\0\3\0\0\0(Rows matched: 0 Cha"..., 16384) = 52 and strace on mysql shows this (that's not helpful I think :/): getsockname(33, {sa_family=AF_FILE, path="/var/run/mysql"}, [30]) = 0 fcntl(33, F_SETFL, O_RDONLY) = 0 fcntl(33, F_GETFL) = 0x2 (flags O_RDWR) fcntl(33, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(33, SOL_IP, IP_TOS, [8], 4) = -1 EOPNOTSUPP (Operation not supported) futex(0x2b202ca082a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2b202ca082a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x2b202ca076e0, FUTEX_WAKE_PRIVATE, 1) = 1 select(13, [10 12], NULL, NULL, NULL) = 1 (in [12]) fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0 accept(12, {sa_family=AF_FILE, NULL}, [2]) = 33 fcntl(12, F_SETFL, O_RDWR) It seams to me, that there is something going wrong with the Job Que
Sometimes I also get this error „JobQueueDB::doAck“. Die Datenbank meldete den Fehler „1205: Lock wait timeout exceeded; try restarting transaction (localhost)“ But it hides on a "second" page right bellow the normals wikipage I think this two error somehow belong to each other
Tentatively moving this to JobQueue component.
The funny thing is, that looking into the JobQueue with maintaince/showJobs.php shows 0 and the apache process shows hunderst of this per second poll([{fd=82, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(82, "\370\0\0\0\3UPDATE /* JobQueueDB::claim"..., 252) = 252 read(82, "0\0\0\1\0\0\0\3\0\0\0(Rows matched: 0 Cha"..., 16384) = 52
Anybody an idea about this? I'm losing money because my wiki isn't running correctly. I have limited mysql with cpulimit to 30% but his also kills the wiki completely, that can not be that mysql is using 100% CPU for hours just because I imported a wikipage.
Is showJobs.php always empty or almost empty? Did you do anything to $wgJobTypeConf? JobQueueDB::claim doesn't get called unless a prior SELECT found a job. If $wgJobRunRate is high and there are many many page requests and only a few jobs (but there are at least some jobs), you could maybe get something like this sometimes. I'd be a bit skeptical about that. Maybe claimRandom() was in a tight loop but I don't see how that's possible either. You can try setting $wgJobRunRate = 0 and daemonizing maintenance/runJobs.php to run in the background instead of running jobs on random page requests.
My jobqueue has 20k. I set $wgJobRunRate = 0.1; maybe this helps. But this would take 3 years with the visits I have. And I'm afraid to kill my server when runing runJobs.php
Do you have any long running transactions (like from scripts doing queries in the background that take hours/days)? This can cause many many deleted but not purged rows in MySQL which can make the queue unusable.
What are long running transactions? Where can I check this? Yesterday I at around 22:00 UTC I made an import, then around 12 hours later the database run on to 30% (I set this as CPU Limit) and the ran on this level for 10 hours which made my wiki be unavailable for 10 hours! This is a really nasty problem, because this kills my Google and Bing rating and makes me lose money. My JobRunRate is at 0.1 and I have 20000 Jobs. What can I do to investigate this problem? I really need to solve this. This is the third time this month and it's really annoying.
https://gerrit.wikimedia.org/r/#/c/63819/ might help if jobs are still run on page views.
Any more on this?
DaSch: Is this still an issue? Did the patch in comment 9 help?
This is merged already. I haven't experienced any problems with the last imports. So this seams to help.
Thanks. Closing as FIXED as per last comment. Please reopen if this happens again.