Last modified: 2014-09-02 13:17:27 UTC
TL;DR: For the files between 2014-08-24 14:00 and 2014-08-27 21:00 webstatscollector output are showing some irregularities. This affects both the pagecounts and projectcount files from https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/ and all services that process them. -------------------------------------------- The pagecount file 20140824 14:00 [1] did not show irregularities. But the file for one day later show a drop of 80% for a few pages. As gadolinium (the host that is writing the pagecount files) showed a high and still increasing process count for no longer needed services (bug 70053), those services got turned off around 2014-08-26 19:00:00. But although more resources were freed on gadolinium, its webstatscollector's collector process degraded further. Since the service did not recover (UDP Receive Buffers filled up again and again, disks could not take the write load, and the service gathered 95GB of virtual memory since its last restart), the service got restarted on 2014-08-28 ~15:32. Since the restart did not relax the situation either, the service was put on tmpfs 2014-08-28 ~19:48, which reduced load on the disks, and made the service work again. The files starting at 2014-08-27 21:00 are good again. Thanks ottomata for all those fixes! More details are in the corresponding IRC channel logs [3]. Closer investigation of the files between 20140824 14:00 and 2014-08-27 21:00 is still pending. [1] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz [2] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz [3] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140827.txt
(In reply to christian from comment #0) > [...] restarted on 2014-08-28 ~15:32. That timestamp is meant to read "2014-08-27 ~15:32." (one day before). > [...] put on tmpfs 2014-08-28 ~19:48, which [...] That timestamp is meant to read "2014-08-27 ~19:48" (one day before).
Any ideas as to the root cause? Do you think this has been an ongoing problem that we haven't noticed?
(In reply to Toby Negrin from comment #2) > Any ideas as to the root cause? Yes. Bug 70053. Due to the increasing number of processes, less and less resources were available for other processes, and at some point it started to affect other services like webstatscollector. > Do you think this has been an ongoing > problem that we haven't noticed? We have no official monitoring of webstatscollector. So we have no way of telling. Hence, we also have no way of being alerted about such issues in the first place. It would be a nice quarterly goal to have proper, official monitoring in place ;-) That would allow us to answer such questions. Mhmm ... that said ... no it's not an ongoing problem. Data up to 2014-08-24 14:00 is within the usual expectations.
(In reply to christian from comment #0) > Closer investigation of the files between 20140824 14:00 and > 2014-08-27 21:00 is still pending. The ending of the interval is crisp: The last bad file is: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-200000.gz The first file that's good again is: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz But due to the lack of granular monitoring, and hard boundaries, finding a crisp beginning of the interval of bad data did not work out. The last known to be good file is: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz The first known to be bad file is: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140825-140000.gz