Last modified: 2014-05-05 13:28:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65933, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63933 - Cohort Validation is not parsing correctly utf-8 usernames
Cohort Validation is not parsing correctly utf-8 usernames
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Wikimetrics (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: nuria
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-15 11:56 UTC by nuria
Modified: 2014-05-05 13:28 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description nuria 2014-04-15 11:56:27 UTC
Capitalization in parse_user function to format strings in the media wiki user format is done assuming 1 byte per character, this breaks with user names whose first character takes up two bytes.

Sample:
Current code:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> s = a.strip().encode('utf-8')
>>> first = s[0]
>>> print first
� -> this is 'half' a character


Correct sequence:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> first = s[0].upper().encode('utf-8')
>>> print first
È


We likely need to review all the code regarding string comparisons on user_names. Perhaps having our own type for user names that wraps encoding issues is best.
Comment 1 Bingle 2014-04-15 12:00:23 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1548
Comment 2 Toby Negrin 2014-04-17 14:46:57 UTC
This bug has been fixed but requires an integration test to close.
Comment 3 Gerrit Notification Bot 2014-04-25 11:30:07 UTC
Change 129672 had a related patch set uploaded by Nuria:
Adding test for cohort uploading for cohort with cyrilic and arabic usernames.

https://gerrit.wikimedia.org/r/129672
Comment 4 Gerrit Notification Bot 2014-04-28 15:25:32 UTC
Change 129672 merged by Milimetric:
Adding test for cohort uploading for cohort with cyrilic and arabic usernames.

https://gerrit.wikimedia.org/r/129672
Comment 5 Dan Andreescu 2014-04-28 18:16:53 UTC
This is deployed and tested in staging.  Please re-open if you have issues in staging.  We will deploy to production on Thursday May 1st.
Comment 6 Sage Ross 2014-05-02 00:13:27 UTC
Did this get deployed already?

I've been testing today and found that utf8 names work fine when uploaded as a txt file, but if I try to use them in the Paste Usernames box, I get "error! Server error while processing your upload".
Comment 7 nuria 2014-05-04 15:42:04 UTC
Sage: Would you mind opening a bug with some examples that we can use to test the issue noting that it only happens via coping usernames in the textbox? We have been working on encoding but there is likely more work to do on the http layer regarding character parsing.
Comment 8 Sage Ross 2014-05-05 13:28:14 UTC
nuria: done as bug 64893.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links