Last modified: 2014-07-08 08:50:01 UTC
The community of Russian Wikinews identified that stats.grok.se (that works on page view statistics from Wikimedia dumps <http://dumps.wikimedia.org/other/pagecounts-raw/>) shows incorrect number of page views. For example, we have page: https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика If we would open statistics (<http://stats.grok.se/ru.n/latest/Категория:Чемпионат_мира_по_футболу_2014/Статистика>) we would see 6 views in last 30 days, though the page had much more views actually - just look at the number of edits of the page: https://ru.wikinews.org/w/index.php?title=Категория:Чемпионат_мира_по_футболу_2014/Статистика&action=history I know that stats.grok.se is an external tool but it works on raw data of WMF and it seems that raw data are prepared incorrectly.
So is there an indicator that the issue is with Wikimedia's data, and not with stats.grok.se processing it?
(In reply to Andre Klapper from comment #1) > So is there an indicator that the issue is with Wikimedia's data, and not > with stats.grok.se processing it? Given the file format it would be very difficult for it to be incorrectly processed. My only comments on this would be: *The pageview stats is based on URL matching, not the actual page, so depending on how the page was /reached/ pageviews may not appear. *Direct comparisons with edit events isn't possible because multiple edit events can be launched from a single pageview, and edit events themselves are excluded from the counter (wrong MIME type)
Aren't these stats based on sampling as well?
Actually I don't know; ErikZ can talk to that bit better than me. The large, aggregate breakdowns definitely are; I'm not sure if we do URL matching against those to produce the output here, or if the output here is based on the raw count.
(In reply to Oliver Keyes from comment #4) > Actually I don't know; ErikZ can talk to that bit better than me. The large, > aggregate breakdowns definitely are; I'm not sure if we do URL matching > against those to produce the output here, or if the output here is based on > the raw count. Looking at some of the raw data - I see a lot of pages with 1 hit. If they were sampled I wouldn't expect that, so please ignore me :)
(In reply to Andre Klapper from comment #1) > So is there an indicator that the issue is with Wikimedia's data, and not > with stats.grok.se processing it? stats.grok.se typically reads our files without problems. webstatscollector (the one producing those files) is more hairy. It broke before. Especially around non-latin characters, it caused issues before. I'll check the files. However, the way things look ... I am not sure if something is broken. Given that we're measuring against edits and how webstatscollector is filtering, everything might just be fine^Wwithin expectations. Checking it nonetheless. (In reply to Oliver Keyes from comment #2) > *Direct comparisons with edit events isn't possible because multiple edit > events can be launched from a single pageview, [...] Right. And for example bots need not do a pageview (in webstatscollector sense). They can edit right away. > and edit events themselves > are excluded from the counter (wrong MIME type) webstatscollector does not care about MIME types, and counts requests regardless of MIME types. However, webstatscollector cares about "/wiki/" being in the URL. And for edits, they are typically made through the API or directly through /w/index.php. None of which have "/wiki/" in the URL and hence do not get counted by webstatscollector. (In reply to Bawolff (Brian Wolff) from comment #3) > Aren't these stats based on sampling as well? It's one of the few parts that is unsampled :-) stats.grok.se is driven by http://dumps.wikimedia.org/other/pagecounts-raw/ which is the output of webstatscollector, which consumes the full unsampled firehose (well ... there is some packet loss).
While the page has some properties that would allow to explain away some effects we're seeing, it turned out that since mid-April, SSL logs were no longer fed into webstatscollector (bug 67456), hence SSL traffic did not get counted on stats.grok.se. Across all projects, SSL traffic does not account for too much, but for ru.wikinews.org SSL traffic seems more relevant. Seeing if I can find more things.
I could not find further hiccups in the counting pipeline than bug 67456 (In reply to christian from comment #7) > While the page has some properties that would allow to explain away some > effects we're seeing, it turned out that since mid-April, SSL logs were > no longer fed into webstatscollector (bug 67456), hence SSL traffic did > not get counted on stats.grok.se. > > Across all projects, SSL traffic does not account for too much, but > for ru.wikinews.org SSL traffic seems more relevant. > --------------------------------------------- (In reply to christian from comment #7) > Seeing if I can find more things. I could not find more things. Checking with 7 consecutive days from end of June, and adding ssl page counts by hand there, page views for this page went from ~1/day up to ~14/day, which more plausible. Also bear in mind, that webstatscollector counts redirects only for the source of redirects, not the targets. So hits for https://ru.wikinews.org/wiki/Чемпионат_мира_по_футболу_2014/Статистика (which redirects to https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика ) are counted only at http://stats.grok.se/ru.n/latest30/Чемпионат_мира_по_футболу_2014/Статистика and show considerably more page views already. But those numbers are going to increase further, once SSL requests get fed into webstatscollector again.
Ssl requests get fed into webstatscollector again.