Last modified: 2014-05-05 13:28:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65933, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 63933 - Cohort Validation is not parsing correctly utf-8 usernames


Summary:	Cohort Validation is not parsing correctly utf-8 usernames

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	Wikimetrics (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High normal
Target Milestone:	---
Assigned To:	nuria

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-04-15 11:56 UTC by nuria
Modified:	2014-05-05 13:28 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description nuria 2014-04-15 11:56:27 UTC

Capitalization in parse_user function to format strings in the media wiki user format is done assuming 1 byte per character, this breaks with user names whose first character takes up two bytes.

Sample:
Current code:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> s = a.strip().encode('utf-8')
>>> first = s[0]
>>> print first
� -> this is 'half' a character


Correct sequence:
>>> a = "èMarianne.ramsès  ".decode('utf-8')
>>> s = a.strip()
>>> first = s[0].upper().encode('utf-8')
>>> print first
È


We likely need to review all the code regarding string comparisons on user_names. Perhaps having our own type for user names that wraps encoding issues is best.

Comment 1 Bingle 2014-04-15 12:00:23 UTC

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1548

Comment 2 Toby Negrin 2014-04-17 14:46:57 UTC

This bug has been fixed but requires an integration test to close.

Comment 3 Gerrit Notification Bot 2014-04-25 11:30:07 UTC

Change 129672 had a related patch set uploaded by Nuria:
Adding test for cohort uploading for cohort with cyrilic and arabic usernames.

https://gerrit.wikimedia.org/r/129672

Comment 4 Gerrit Notification Bot 2014-04-28 15:25:32 UTC

Change 129672 merged by Milimetric:
Adding test for cohort uploading for cohort with cyrilic and arabic usernames.

https://gerrit.wikimedia.org/r/129672

Comment 5 Dan Andreescu 2014-04-28 18:16:53 UTC

This is deployed and tested in staging.  Please re-open if you have issues in staging.  We will deploy to production on Thursday May 1st.

Comment 6 Sage Ross 2014-05-02 00:13:27 UTC

Did this get deployed already?

I've been testing today and found that utf8 names work fine when uploaded as a txt file, but if I try to use them in the Paste Usernames box, I get "error! Server error while processing your upload".

Comment 7 nuria 2014-05-04 15:42:04 UTC

Sage: Would you mind opening a bug with some examples that we can use to test the issue noting that it only happens via coping usernames in the textbox? We have been working on encoding but there is likely more work to do on the http layer regarding character parsing.

Comment 8 Sage Ross 2014-05-05 13:28:14 UTC

nuria: done as bug 64893.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links