Last modified: 2012-11-27 13:08:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4290, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 2290 - Disallow usernames that are too similar to existing names (confusables, impersonation)
Disallow usernames that are too similar to existing names (confusables, imper...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
User login and signup (Other open bugs)
unspecified
All All
: Normal enhancement with 3 votes (vote)
: ---
Assigned To: Neil Harris
:
: 3313 3982 (view as bug list)
Depends on:
Blocks: unicode 3985
  Show dependency treegraph
 
Reported: 2005-06-02 13:24 UTC by Neil Harris
Modified: 2012-11-27 13:08 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Python code for filtering usernames (89.45 KB, text/plain)
2006-09-14 12:46 UTC, Neil Harris
Details
Python code for filtering usernames (89.59 KB, text/plain)
2006-09-14 14:10 UTC, Neil Harris
Details
Experimental language-code-to-script-code mapping (2.56 KB, text/plain)
2006-09-14 23:02 UTC, Neil Harris
Details
Experimental language-code-to-script-code mapping; (4.03 KB, text/plain)
2006-09-15 21:55 UTC, Neil Harris
Details
Python code for filtering usernames, v0.3 (89.96 KB, text/plain)
2006-09-18 08:00 UTC, Neil Harris
Details
Python code for filtering usernames, v0.4 (90.14 KB, text/plain)
2006-09-18 08:42 UTC, Neil Harris
Details
New confusables equivalence sets file, generated from UTR#39 confusables.txt (33.90 KB, text/plain)
2006-11-14 00:40 UTC, Neil Harris
Details
Some extra confusables (UTF-8 format text file) (291 bytes, text/plain)
2006-11-14 01:06 UTC, Neil Harris
Details
Some extra confusables, v2 (UTF-8 format text file) (325 bytes, text/plain)
2006-11-14 01:21 UTC, Neil Harris
Details
New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras (33.65 KB, text/plain)
2006-11-14 01:25 UTC, Neil Harris
Details
Python code for creating equivalence sets of characters (1.33 KB, text/plain)
2006-11-14 01:32 UTC, Neil Harris
Details

Description Neil Harris 2005-06-02 13:24:36 UTC
A more and more common form of abuse consists of vandals and trolls registering
new accounts that "look like" other users' accounts, by using characters that
look like other characters. For example, "l" may be used instead of "I", or an
acute-accented 'i' used instead of an ordinary one. These accounts can cause no
end of trouble by being used to conceal other kinds of mischief, or to get the
impersonated user into trouble. It is very difficult to tell these apart without
detailed inspection, and the software at present has no idea of visual
similarity between usernames.

Proposed solution:

Keep a homograph character table, and for each new username, canonicalize it by
applying the homograph table to it. Then compare this canonicalized version of
the name with a pre-existing list of canonicalized usernames, and block it if it
occurs in that list. In this way, registering a username will block the
registration of other "confusingly similar" usernames.

The good news is that that the heavy lifting for this work has already been
performed as part of trying to close the same spoofing hole for
internationalized domain names, and homograph lists have already been compiled
as part of this work. E-mail me if you want me to dig out the lists; I don't
have links to them to hand on this machine.
Comment 1 Neil Harris 2005-06-02 13:29:44 UTC
See the references towards the end of http://unicode.org/reports/tr36/ for a
very simple example of confusables data file; but I know that much more complete
ones have been compiled elsewhere...
Comment 2 Neil Harris 2005-06-04 00:43:37 UTC
Here is the URL for the very nicely compiled multilingual confusables file, in
what I hope is a sufficiently self-documenting format:

http://unicode.org/reports/tr36/draft/confusables.txt

Persumably the "official" TR36 file, and any updates, will also be in a similar
format.
Comment 3 Zhen Lin 2005-06-06 01:38:00 UTC
During a vandal attack on a MediaWiki installation I run, the vandal used
Cyrillic lookalikes to impersonate an administrator. No amount of visual
scrutiny would have revealed anything, since typically Cyrillic glyphs are
copied from the Latin lookalikes. Fortunately this is also covered in the
confusables table.
Comment 4 Zigger 2005-09-06 13:30:59 UTC
*** Bug 3313 has been marked as a duplicate of this bug. ***
Comment 5 Filip Maljkovic [Dungodung] 2005-09-06 16:07:11 UTC
I just want to add that many cyrillic letters look the same as letters in latin
script,  so confusion is possible. The letters are "A B C E H J K M O P T X a c
e j o p x" as opposed to "%D0%90 %D0%92 %D0%A1 %D0%95 %D0%9D %D0%88 %D0%9A
%D0%9C %D0%9E %D0%A0 %D0%A2 %D0%A5 %D0%90 %D1%81 %D0%B5 %D1%98 %D0%BE %D1%80
%D1%85" (as shown in the nav-bar). They are all the same, except for one pair,
which is extremely similar.
Comment 6 Ævar Arnfjörð Bjarmason 2005-10-07 20:55:47 UTC

*** This bug has been marked as a duplicate of 1524 ***
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-04 15:12:20 UTC
People are/were discussing this at bug 1524, but this remains a separate issue.
 It took me forever to find this by searching, since it was closed.
Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-04 15:12:47 UTC
*** Bug 3982 has been marked as a duplicate of this bug. ***
Comment 9 Neil Harris 2006-09-14 12:46:21 UTC
Created attachment 2347 [details]
Python code for filtering usernames

Here's some Python code to canonicalize user names to reject most spoofing
attacks. The program also returns an error status if the username is malformed,
for example by containing non-script characters, or mixing two incompatible
scripts.

The general idea is to keep a canonicalized version of each username in another
table, and, when registering a new username, look up the canonicalized username
to see if it is already registered. If it is, the user should be told that
their username is too similar to an existing username, and prompted to try
again.

For example: 

"SOME USERNAME" canonicalizes to v1:50MEU5EMAME (the v1: is a version tag, in
case the canonicalization code ever changes). The same canonical string will be
generated for "some username", "SOME USERNAME!!!!!", "S0ME U5ERNAME", and so
on... 

I can easily add other filters, so that, for example, "Some Username5"
canonicalizes to the same string as "Some Username 4", and "Bad, bad user"
would canonicalize to the same string as "Bad, bad, bad user".

This version of the code is a bit aggressive, as it assumes that labels can be
in any one script, so E, H, and N are currently considered equivalent because
of the need for transitivity between different cases of different scripts: if
usernames can be restricted to a small subset of possible scripts, some of the
more aggressive canonicalization can be relaxed, and E, H, and N can again be
distinguished.

Preliminary testing shows that this code appears to have a false-positive rate
of under 1% on random plausible names, which is probably acceptable.
Comment 10 Neil Harris 2006-09-14 12:47:59 UTC
Oh, and I should mention, just in case you're not reading the code, that it
works on a vast number of scripts.
Comment 11 Neil Harris 2006-09-14 14:10:43 UTC
Created attachment 2348 [details]
Python code for filtering usernames

Murphy's law in action: the example I gave the attachment comment is an edge
case that didn't get tested properly: now fixed.
Comment 12 Neil Harris 2006-09-14 23:02:24 UTC
Created attachment 2354 [details]
Experimental language-code-to-script-code mapping

This file attempts to map languages to sets of possible scripts. Where a
language can be written in multiple scripts, both script codes are added. Where
multiple scripts can be used for a language, all scripts known are included.

Where an example character does not have a script code, it is output as U+XXXX.
Comment 13 Neil Harris 2006-09-15 21:55:47 UTC
Created attachment 2363 [details]
Experimental language-code-to-script-code mapping;

Now with 79 more script repertoires, based on analyzing the wikipedia.org front
page
Comment 14 Neil Harris 2006-09-18 08:00:11 UTC
Created attachment 2369 [details]
Python code for filtering usernames, v0.3

Now uses stdin/stdout for input and output, thus allowing for batch conversion
and freeing the command line up for later addition of option flags.
Comment 15 Neil Harris 2006-09-18 08:42:27 UTC
Created attachment 2370 [details]
Python code for filtering usernames, v0.4

Now with exception handling, just in case of nasty attacks (eg. BiDi
violations) intended to blow up the low-level Unicode-processing code.
Comment 16 Brion Vibber 2006-09-19 11:01:55 UTC
I've translated Neil's code to PHP, committed in r16555.

Can build an extension around that to check on account creation.

Currently there are some lazy and inefficient bits; it runs about 30% slower than the Python 
version on the set of usernames from meta.wikimedia.org, but that's plenty fast for the individual 
checking, a smidge under a millisecond per name on a 2 GHz G5. (Live check will just be a single 
name munging and a DB lookup.)
Comment 17 Neil Harris 2006-11-13 01:11:51 UTC
There are false positive problems with the existing code which need a more
careful second pass to check strings which match the initial checks. Code to
follow...
Comment 18 Neil Harris 2006-11-14 00:40:56 UTC
Created attachment 2699 [details]
New confusables equivalence sets file, generated from UTR#39 confusables.txt

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file. Remember to normalize strings
before applying these mappings...
Comment 19 Neil Harris 2006-11-14 01:06:42 UTC
Created attachment 2700 [details]
Some extra confusables (UTF-8 format text file)

Some extra confusables that are not in UTR#39, spotted by eye.
Comment 20 Neil Harris 2006-11-14 01:21:59 UTC
Created attachment 2702 [details]
Some extra confusables, v2 (UTF-8 format text file)

A second version of the above...
Comment 21 Neil Harris 2006-11-14 01:25:14 UTC
Created attachment 2703 [details]
New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file, combined with my
extra_confusables.txt file. Remember to normalize strings
before applying these mappings...
Comment 22 Neil Harris 2006-11-14 01:29:39 UTC
Note: some letterforms are confusable with more than one other letterform, but
these other letterforms are not confusable with each other. This should be taken
into account in later, more sophisticated, versions of this code.
Comment 23 Neil Harris 2006-11-14 01:32:57 UTC
Created attachment 2704 [details]
Python code for creating equivalence sets of characters
Comment 24 Rob Church 2006-11-29 15:34:47 UTC
This was poked, prodded, converted and ported into the AntiSpoof extension,
available in Subversion.
Comment 25 Larry Pieniazek 2006-12-13 17:35:22 UTC
Not sure if this comment belongs against this bog but I have userid "Lar" on
many WMF wikis. I recently started having trouble registering this userid on new
wikis as a conflict with user "Iar"... based on discussion on #mediawiki it was
suggested that this is because the software sees uppercase I and lowercase L as
similar, and that's tripping me up. I'm not sure how to get around that best,
but it's a nuisnace to have to contact each wiki admin separately. See Neil
Harris's comment of 11-14 01:29 which perhaps alludes to this...  presumably
once WMF wikis have SUL this goes away?
Comment 26 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-12-14 03:33:54 UTC
It would no longer be a problem for existing users, but it would still be a
problem for people signing up for a WMF account for the first time, so it's
still undesirable.

See bug 8257.
Comment 27 Anu 2012-11-27 10:27:56 UTC
I have a question to ask here, In different languages, the same characters can be identified as different names? Does Python code take of this?

Can this thread be closed?
Comment 28 Andre Klapper 2012-11-27 13:08:17 UTC
Anu: This report/"thread" has been closed as RESOLVED FIXED six years ago already, and MediaWiki does not use Python code here. 
Please refrain from commenting on this ticket - thanks. :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links