Last modified: 2014-07-08 08:50:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69411, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67411 - page view statistics for Wikinews seem to be wrong
page view statistics for Wikinews seem to be wrong
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: christian
u=Community c=General/Unknown p=0 s=2...
:
Depends on: 67456
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-02 11:31 UTC by Rubin16
Modified: 2014-07-08 08:50 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Rubin16 2014-07-02 11:31:26 UTC
The community of Russian Wikinews identified that stats.grok.se (that works on page view statistics from Wikimedia dumps <http://dumps.wikimedia.org/other/pagecounts-raw/>) shows incorrect number of page views.

For example, we have page:
https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика

If we would open statistics (<http://stats.grok.se/ru.n/latest/Категория:Чемпионат_мира_по_футболу_2014/Статистика>) we would see 6 views in last 30 days, though the page had much more views actually - just look at the number of edits of the page:
https://ru.wikinews.org/w/index.php?title=Категория:Чемпионат_мира_по_футболу_2014/Статистика&action=history

I know that stats.grok.se is an external tool but it works on raw data of WMF and it seems that raw data are prepared incorrectly.
Comment 1 Andre Klapper 2014-07-02 12:43:14 UTC
So is there an indicator that the issue is with Wikimedia's data, and not with stats.grok.se processing it?
Comment 2 Oliver Keyes 2014-07-02 13:05:49 UTC
(In reply to Andre Klapper from comment #1)
> So is there an indicator that the issue is with Wikimedia's data, and not
> with stats.grok.se processing it?

Given the file format it would be very difficult for it to be incorrectly processed.

My only comments on this would be:

*The pageview stats is based on URL matching, not the actual page, so depending on how the page was /reached/ pageviews may not appear.
*Direct comparisons with edit events isn't possible because multiple edit events can be launched from a single pageview, and edit events themselves are excluded from the counter (wrong MIME type)
Comment 3 Bawolff (Brian Wolff) 2014-07-02 17:09:57 UTC
Aren't these stats based on sampling as well?
Comment 4 Oliver Keyes 2014-07-02 17:12:19 UTC
Actually I don't know; ErikZ can talk to that bit better than me. The large, aggregate breakdowns definitely are; I'm not sure if we do URL matching against those to produce the output here, or if the output here is based on the raw count.
Comment 5 Bawolff (Brian Wolff) 2014-07-02 17:28:48 UTC
(In reply to Oliver Keyes from comment #4)
> Actually I don't know; ErikZ can talk to that bit better than me. The large,
> aggregate breakdowns definitely are; I'm not sure if we do URL matching
> against those to produce the output here, or if the output here is based on
> the raw count.

Looking at some of the raw data - I see a lot of pages with 1 hit. If they were sampled I wouldn't expect that, so please ignore me :)
Comment 6 christian 2014-07-02 17:42:13 UTC
(In reply to Andre Klapper from comment #1)
> So is there an indicator that the issue is with Wikimedia's data, and not
> with stats.grok.se processing it?

stats.grok.se typically reads our files without problems.
webstatscollector (the one producing those files) is more hairy.
It broke before. Especially around non-latin characters, it caused
issues before.

I'll check the files.

However, the way things look ... I am not sure if something is
broken. Given that we're measuring against edits and how
webstatscollector is filtering, everything might just be fine^Wwithin
expectations.

Checking it nonetheless.



(In reply to Oliver Keyes from comment #2)
> *Direct comparisons with edit events isn't possible because multiple edit
> events can be launched from a single pageview, [...]

Right. And for example bots need not do a pageview (in
webstatscollector sense). They can edit right away.



> and edit events themselves
> are excluded from the counter (wrong MIME type)

webstatscollector does not care about MIME types, and counts requests
regardless of MIME types.

However, webstatscollector cares about "/wiki/" being in the URL. And
for edits, they are typically made through the API or directly through
/w/index.php. None of which have "/wiki/" in the URL and hence do not
get counted by webstatscollector.



(In reply to Bawolff (Brian Wolff) from comment #3)
> Aren't these stats based on sampling as well?

It's one of the few parts that is unsampled :-)
stats.grok.se is driven by

  http://dumps.wikimedia.org/other/pagecounts-raw/

which is the output of webstatscollector, which consumes the full
unsampled firehose (well ... there is some packet loss).
Comment 7 christian 2014-07-03 08:25:34 UTC
While the page has some properties that would allow to explain away some
effects we're seeing, it turned out that since mid-April, SSL logs were
no longer fed into webstatscollector (bug 67456), hence SSL traffic did
not get counted on stats.grok.se.

Across all projects, SSL traffic does not account for too much, but
for ru.wikinews.org SSL traffic seems more relevant.

Seeing if I can find more things.
Comment 8 christian 2014-07-06 08:33:34 UTC
I could not find further hiccups in the counting pipeline than bug 67456

(In reply to christian from comment #7)
> While the page has some properties that would allow to explain away some
> effects we're seeing, it turned out that since mid-April, SSL logs were
> no longer fed into webstatscollector (bug 67456), hence SSL traffic did
> not get counted on stats.grok.se.
> 
> Across all projects, SSL traffic does not account for too much, but
> for ru.wikinews.org SSL traffic seems more relevant.
> 
---------------------------------------------

(In reply to christian from comment #7)
> Seeing if I can find more things.

I could not find more things.

Checking with 7 consecutive days from end of June, and adding ssl page
counts by hand there, page views for this page went from ~1/day up to
~14/day, which more plausible.

Also bear in mind, that webstatscollector counts redirects only for
the source of redirects, not the targets. So hits for
  https://ru.wikinews.org/wiki/Чемпионат_мира_по_футболу_2014/Статистика
(which redirects to
  https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика
) are counted only at
  http://stats.grok.se/ru.n/latest30/Чемпионат_мира_по_футболу_2014/Статистика
and show considerably more page views already. But those numbers are
going to increase further, once SSL requests get fed into
webstatscollector again.
Comment 9 christian 2014-07-08 08:50:01 UTC
Ssl requests get fed into webstatscollector again.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links