Last modified: 2013-04-22 16:51:41 UTC
We're logging around 6000-10000 job queue OOMs per day: $ for day in `seq 15 17`; do echo -n "August $day: "; zgrep -A2 'Allowed memory size of' fatal.log-201208$day.gz | grep unknown-host | wc -l ; done August 15: 7239 August 16: 9737 August 17: 6492 They are OOMs from various points in the parser, with RefreshLinksJob2::run() as the ultimate caller. These cause collateral damage beyond the article that actually triggered the OOM, since the whole RefreshLinks2 batch is lost. Perhaps there is a memory leak.
Starting looking at this a bit. Also, see https://gerrit.wikimedia.org/r/22497.
Speaking with Aaron now, it would seem this one maybe isn't a problem anymore. Tim, does this still look like a problem to you?
Actually it looks just as frequent as before.
OK, more info: aaron@fluorine:~/mw-log$ for day in `seq 25 30`; do echo -n "Sep $day: "; zgrep -A2 'Allowed memory size of' archive/fatal.log-201209$day.gz | grep unknown-host | wc -l ; done Sep 25: 1359 Sep 26: 979 Sep 27: 823 Sep 28: 812 Sep 29: 769 Sep 30: 970
Seems lower the last few weeks. aaron@fluorine:~/mw-log$ for day in `seq 25 31`; do echo -n "Dec $day: "; zgrep -A2 'Allowed memory size of' archive/fatal.log-201212$day.gz | grep unknown-host | wc -l ; done Dec 25: 11 Dec 26: 11 Dec 27: 239 Dec 28: 46 Dec 29: 3 Dec 30: 4 Dec 31: 1
The jobs runner memory limits were doubled and the wikidata job batch sizes where also halved (again) on Apr 16 (those were piling OOMs of there own).
aaron@fluorine:~/mw-log$ for day in `seq 10 21`; do echo -n "Day $day: "; zgrep -A2 'Allowed memory size of' archive/fatal.log-201304$day.gz | grep -P "mw10(0[1-9]|1[0-6])" | wc -l ; done Day 10: 6665 Day 11: 16169 Day 12: 29571 Day 13: 1879 Day 14: 142 Day 15: 6 Day 16: 141 Day 17: 27 Day 18: 0 Day 19: 0 Day 20: 0 Day 21: 0