Last modified: 2014-04-17 14:53:10 UTC
I am trying to upload a cohort of 103 users and get a "validation failure" bug. I was successful with a subset of 4 users, but not with larger subsets (I tried splitting into two and uploading each "half" and still got validation failures).
I'm having a similar problem when I test upload a cohort. I uploaded my cohort yesterday and it worked (after Dan fixed my other problem). I upload the same cohort today and get "validation: FAILURE". Is it because of the size of the cohort? Pete's is 100+ and mine is 400+ (In reply to Pete F from comment #0) > I am trying to upload a cohort of 103 users and get a "validation failure" > bug. I was successful with a subset of 4 users, but not with larger subsets > (I tried splitting into two and uploading each "half" and still got > validation failures).
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1539
Created attachment 15084 [details] screenshot of error message Same cohort uploaded yesterday and validated successfully.
Hi Tighe -- do we have the cohorts? thanks, -Toby
Created attachment 15085 [details] Tighe's cohort CSV Some of the usernames will be invalid (known issue to me), but most should be valid and should not result in validation:FAILURE
Thanks for the bug report Tighe. I tried to debug for an hour or so tonight and I have some progress but no resolution. Unfortunately I won't be able to get back to this until Monday or Tuesday next week. So, it's not the size of the cohort, wikimetrics accepts much larger cohorts than the ones you're mentioning. And the problem seems totally unrelated to yesterday's bugs. As far as I can tell, wikimetrics is choking on some non-standard characters that show up in your cohort. These shouldn't have validated yesterday either, so I'm thinking somehow this file has changed since then. If that's not true, then I'm very puzzled because the code hasn't changed at all and all the ghost process problems are gone. Here's what I did: I stripped any non-ascii character from your cohort and validated that, and that worked fine. about 80+ users were found valid. That's obviously not a solution but it goes to say that we're having strange character issues. I will look into this in more depth and provide a fix early next week. The really strange thing is that I expressly tested these kinds of characters and they worked, and they also work fine on my local machine.
This may be useful if someone else decides to debug: milimetric@wikimetrics-staging1:/srv/wikimetrics$ sudo tail -f /var/log/upstart/wikimetrics-queue.log return self.run(*args, **kwargs) File "/srv/wikimetrics/wikimetrics/models/validate_cohort.py", line 21, in async_validate validate_cohort.run() File "/srv/wikimetrics/wikimetrics/models/validate_cohort.py", line 109, in run self.validate_records(session, cohort) File "/srv/wikimetrics/wikimetrics/models/validate_cohort.py", line 177, in validate_records validate_users(wikiusers, project, self.validate_as_user_ids) File "/srv/wikimetrics/wikimetrics/models/validate_cohort.py", line 270, in validate_users raise e UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 13: ordinal not in range(128) The problem is in wikimetrics/controllers/forms/cohort_upload.py:parse_username Basically, character set handling in python 2.x is unfairly difficult and seemingly randomly stops working. We want to switch to python 3.x and I think this is even more proof that we should
Thanks Dan. In the short term will try uploading a subset that excludes non-alphanumeric characters. I think that will allow me to learn what I need, and I can (I think?) expand my cohort in the future if and when this is resolved.
The failure is actually happening on decode, rather than encode but since we are swallowing errors there we see it on the line after. File: wikimetrics/controllers/forms/cohort_upload.py Line where error is first presen: username = username.decode('utf8', errors='ignore') Still looking into it.
Change 125961 had a related patch set uploaded by QChris: Fix type of user_name in SQLAlchemy's model of MediaWiki's user table https://gerrit.wikimedia.org/r/125961
Change 125961 abandoned by QChris: Fix type of user_name in SQLAlchemy's model of MediaWiki's user table Reason: The Bug 63836 will get fixed by https://gerrit.wikimedia.org/r/#/c/125752/ instead. https://gerrit.wikimedia.org/r/125961
The fix allowed me to validate my cohort (attached) but it appears to have rejected all usernames in Arabic script.
Fixed by https://gerrit.wikimedia.org/r/#/c/125752/ Tighe, I know you haven't confirmed yet so feel free to reopen if you find issues. I'm closing because two other users confirmed the fix.