Last modified: 2013-03-20 00:54:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47178, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45178 - Space characters in [pagecounst-raw] titles
Space characters in [pagecounst-raw] titles
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Diederik van Liere
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-19 23:26 UTC by Andrew G. West
Modified: 2013-03-20 00:54 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andrew G. West 2013-02-19 23:26:30 UTC
Beginning Feb. 1. space characters began appearing in *some* article titles in the pagecount data at http://dumps.wikimedia.org/other/pagecounts-raw/.

This file is space delimited, so this is breaking some parsing schemes. I understand that some of the internal logs were changing to a tab-delimited format, but this was not supposed to effect the pagecount stuff:

http://lists.wikimedia.org/pipermail/wikitech-l/2013-January/066007.html

http://en.wikipedia.org/wiki/Wikipedia:VPT#Format_Change_of_Page_View_Stats
Comment 1 Diederik van Liere 2013-02-20 00:00:18 UTC
Hey Andrew,

Thanks for reaching out! Yes you are right, there are a couple of 1000's titles that have spaces in the titles and this indeed happened after the tab introduction but in an unexpected way.

Prior to the tab introduction, the title of the page would be truncated (because we used space as a delimiter) and so incorrect / incomplete titles would show up in the dumps data. Now, with the introduction of the space we really surfaced this bug. 

The space is introduced because under very rare conditions, the Nginx server does not encode the space as %20; so far I have only see this happening if the request comes from Googlebot, and the server response is 301 (Moved Permanently).

We tried to replicate the conditions so we could fix our Nginx server configuration but we have not yet been able to do so. We could add a function in webstatscollector (the software that generates the data) to replace those spaces with %20 but I am worried that this will introduce performance regressions. 

My plan is:
1) We will test webstatscollector with a replace function, if this all works, great! problem solved.
2) If the replace function introduces a performance regression then I will mark this bug as WONTFIX. 

Rest assured, it affects only a really really small set of articles and those views are not real views in the first place as they come from Googlebot.
Comment 2 Diederik van Liere 2013-03-08 21:47:15 UTC
You can track progress of this bug at https://mingle.corp.wikimedia.org/projects/analytics/cards/129
Comment 3 Diederik van Liere 2013-03-20 00:54:09 UTC
Commit https://gerrit.wikimedia.org/r/#/c/51680/ fixes this.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links