Last modified: 2014-02-06 12:47:26 UTC
Christian: I just noticed that the November directory of the pagecounts-ez/merged files at: http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-11/ looks wrong. There are so many files ending in ".~" instead of ".bz2". Also the timestamps differ from previous months. So for example each of the files in http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-09/ have been created on the day following the date in the file name. P.S.: I noticed that the problem seems to have started in October: http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-10/ There the 2013-10-24 file is not a ".bz2", but ".~". That date struck me. Although it's probably completely unrelated, we had (for first time) a strange log line in the zero logs at that same day. There the timestamp of a log line has been mangled [1]. We're seeing such requests more and more these days. [1] ___________________________________________________________ qchris@stat1002 // 0 // 20:18:05 cwd: ~ zcat /a/squid/archive/zero/zero.tsv.log-20131024 | cut -f 3 | grep -C 5 201cp3011 2013-10-23T13:29:23 2013-10-23T13:29:23 2013-10-23T13:29:23 2013-10-23T13:29:23 2013-10-23T13:29:23 201cp3011.esams.wikimedia.org 2013-10-23T13:29:24 2013-10-23T13:29:24 2013-10-23T13:29:24 2013-10-23T13:29:24 2013-10-23T13:29:24
Prioritization and scheduling of this bug is tracked on Mingle card https://mingle.corp.wikimedia.org/projects/analytics/cards/1287
Any idea of the impact of this issue? Is it a problem? -Toby
Low impact. But I will fix in the new year. In the meantime monthly totals are extrapolated from remaining days. And once fixed all missing files can be recreated from permanently stored raw data.
It certainly is a problem for me; I used those files several times. (E.g. to understand the data we're seeing, to understand webstatscollector, to understand pageviews) Of course, I can run the aggregations myself upon need, but that means a huge delay and waste of time :-/ Besides it is a public set of daily data that has not been updated since ~1.5 months :-(
Not even 2 hours past comment #4 and I would already have needed the data again :-) I've just been pointed towards bug #58316. As we do not see the \x hits in the sampled logs, I would naturally use Erik's merged files to see if the problem is webstatscollector related. Falling back to doing it by hand. Meh.
Sorry, I did not realize you use it that often. I will look at in the coming days.
(In reply to comment #6) > Sorry, I did not realize you use it that often. I will look at in the coming > days. Sorry, my point was not to mess with your scheduling. Not at all! I just wanted to show that the data indeed gets used. It's perfectly fine by me if we fix it early 2014.
The daily files come in as expected again.