Last modified: 2014-03-30 18:53:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65242, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 63242 - ccnorm revamp: add a more sensible interface for normalised comparison


Summary:	ccnorm revamp: add a more sensible interface for normalised comparison

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	AntiSpoof (Other open bugs)
Version:	master
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n

Depends on:	63217
Blocks:
	Show dependency tree / graph

Reported:	2014-03-28 22:22 UTC by Nemo
Modified:	2014-03-30 18:53 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nemo 2014-03-28 22:22:44 UTC

As discussed on bug 27987, the current practice to run ccnorm on things and then compare them to the alleged canonical form of a string is not viable.

The first problem is that often users are not comparing normalised strings to normalised strings; apple and oranges comparisons have unpredictable results. See bug 27987 comment 22 and bug 27987 comment 24.

Tim proposed something like:

(Tim Starling from bug 27987 comment 20)
> Well, how about
> 
> added_lines cclike "testing|vandalizing"
> 
> Where the regex would be tokenized and reassembled, with alphabetic parts
> normalised with equivset?

That's ok but I think a more sensible syntax would be like

cclike(added_lines, testing) || cclike(added_lines, vandalizing)

That is, a single function should take two strings and tell us if, once canonicalised in whatever manner the code wants, they are the same thing, AKA if they are confusable.

This is nothing special: it's the approach followed by the standard API to ICU data, see uspoof_areConfusable 	in <https://ssl.icu-project.org/apiref/icu4c/uspoof_8h.html#ac96fdf642bfd9efcd0d9956bd76cadaa>, found from the documents mentioned in bug 63217. I was pointed to UTS #36 and UTS #39 by Nikerabbit, they were just drafts when AntiSpoof was created. Now we have better tools.

I'm marking this as blocked on bug 63217 because such a function seems trivial to implement with the ICU API. I'll comment there more in general.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links