Last modified: 2014-02-12 23:37:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T41501, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 39501 - Merging Unicode similar-looking characters in internal search (apostrophes, "x" and "×", etc)
Merging Unicode similar-looking characters in internal search (apostrophes, "...
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: Future release
Assigned To: Nobody - You can work on this!
: utf8
: 47881 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-20 13:07 UTC by Denis Jacquerye
Modified: 2014-02-12 23:37 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Denis Jacquerye 2012-08-20 13:07:25 UTC
When doing a search with the apostrophe character U+0027 "apostrophe/single quote" available on most keyboard, results should match other Unicode apostrophe-like characters like the preferred apostrophe U+2019 and others.

In 2009 there was a discussion about "Different apostrophe signs and MediaWiki internal search" see
http://www.gossamer-threads.com/lists/wiki/wikitech/169177
This doesn't seem to have been implemented.

This is related to bug 36313 for autocompletion.

Basically indexing should convert all apostrophes to U+0027, and searching should convert all apostrophes to U+0027. So articles containing U+2019 for exemple would be matches when search with U+0027, U+2019 or other apostrophes.

From the 2009 discussion, the list of apostrophes was:
U+0027 APOSTROPHE 
U+2018 LEFT SINGLE QUOTATION MARK 
U+2019 RIGHT SINGLE QUOTATION MARK 
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK 
U+2032 PRIME 
U+00B4 ACUTE ACCENT 
U+0060 GRAVE ACCENT 
U+FF40 FULLWIDTH GRAVE ACCENT 
U+FF07 FULLWIDTH APOSTROPHE

I would add other characters for which U+0027 is often used as an accessible substitute like some modifier letters and saltillo:
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO

Webkit-based browsers already do this kind of stripping and merge U+0027, U+2018, U+2019, U+FF07. However there are many cases where merge all the proposed characters would help regular keyboard input.

The proposed solution in 2009 was to use a strip function:
function stripForSearch( $string ) { 
$s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); 
return parent::stripForSearch( $s );
Comment 2 Denis Jacquerye 2012-08-20 14:18:36 UTC
oops the second search is meant to be:
https://fr.wikipedia.org/w/index.php?title=Spécial%3ARecherche&profile=default&search=%22prince+d’Ithaque%22&fulltext=Search&searchengineselect=mediawiki

Another example is searching for "O'" on fr.w:
https://fr.wikipedia.org/w/index.php?search=o%27&title=Spécial%3ARecherche&fulltext=1
The article "O'" (which is a redirect to "O") https://fr.wikipedia.org/w/index.php?title=O%27&redirect=no is found as an exact match but "O’" https://fr.wikipedia.org/wiki/O’ and "Oʻ"  https://fr.wikipedia.org/wiki/Oʻ are not on the first page of the search results.
Comment 3 Andre Klapper 2013-03-26 11:20:03 UTC
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
Comment 4 Chad H. 2013-10-29 15:43:40 UTC
*** Bug 47881 has been marked as a duplicate of this bug. ***
Comment 5 Chad H. 2013-10-29 15:45:26 UTC
Widening scope a tiny bit. If we're going to do this it should be done all at once.

AntiSpoof's sort of the idea I'm thinking here.

Repurposing into a Cirrus bug as lsearchd has been end-of-lifed and won't be fixed further.
Comment 6 Nik Everett 2013-11-05 16:22:19 UTC
Chad,

Were you thinking this should be done in Cirrus for all languages by pushing analysis configuration to Elasticsearch?  Something along those lines would be pretty flexible, allowing, for example, us to boost perfect matches of the typed unicode characters above the squashed ones.  I'm not saying that is a good idea, just something that is possible.
Comment 7 Chad H. 2013-11-05 16:49:47 UTC
(In reply to comment #6)
> Chad,
> 
> Were you thinking this should be done in Cirrus for all languages by pushing
> analysis configuration to Elasticsearch?  Something along those lines would
> be
> pretty flexible, allowing, for example, us to boost perfect matches of the
> typed unicode characters above the squashed ones.

Yeah that was pretty much my thinking.

> I'm not saying that is a
> good idea, just something that is possible.

I think it's a good idea, eventually. I set priority so low on purpose :)
Comment 8 Nik Everett 2013-12-27 15:14:22 UTC
Added see also bug.  I think we should do this when we pull the unicode plugin in to Elasticsearch.
Comment 9 MZMcBride 2014-01-05 01:59:57 UTC
Looks like apostrophes came up on The Daily WTF: <http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx> (specifically <http://img.thedailywtf.com/images/14/q1/e95/Pic-5.jpg>).

(In reply to comment #6)
> Were you thinking this should be done in Cirrus for all languages by pushing
> analysis configuration to Elasticsearch?  Something along those lines would
> be pretty flexible, allowing, for example, us to boost perfect matches of the
> typed unicode characters above the squashed ones.

We already do some input normalization at some level of the stack (for example, multiple underscores get squashed and input such as "AbrAhAm LincoLn" works if there's a redirect at "Abraham lincoln").

It's difficult to look at the provided screenshot and not think that the software has failed our readers. Unless you think these should be MediaWiki page redirects (#REDIRECT)? I think we should do better normalization for search inputs.

Any rough idea how big of a project this would be to implement?
Comment 10 MZMcBride 2014-01-05 02:03:51 UTC
(In reply to comment #9)
> We already do some input normalization at some level of the stack (for
> example, multiple underscores get squashed and input such as "AbrAhAm LincoLn"
> works if there's a redirect at "Abraham lincoln").

To be more explicit on these points:

https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=AbrAhAm+LincoLn

https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=_____AbrAhAm_____LincoLn_____

We may be able to implement apostrophe normalization at the same level.
Comment 11 Nik Everett 2014-01-05 17:57:24 UTC
I'll have a look at this when I can.  For now I'll leave the component set to CirrusSearch.  It looks like PHP implements the same normalization components that I can use in Elasticsearch (http://php.net/manual/en/class.normalizer.php) so I'll have to evaluate doing that normalization there as well.  I imagine we'll if we do it in php it'll have to be optional because the normalizer requires PHP 5 >= 5.3.0 and PECL intl >= 1.0.0.
Comment 12 Nik Everett 2014-01-05 18:06:20 UTC
In case anyone comes to this from http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx#Pic-5, they should have a look at Bug 59666 which should plug that particular embarrassing hole.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links