Last modified: 2014-10-27 19:36:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69521, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67521 - Suggest results which differ in diacritics (missing ascii normalized lookup)
Suggest results which differ in diacritics (missing ascii normalized lookup)
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal with 5 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 68239 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-04 05:54 UTC by Tisza Gergő
Modified: 2014-10-27 19:36 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Bug 67521, rowiki, testcase 1 (Loïc vs Loic) (107.44 KB, image/png)
2014-10-27 00:49 UTC, Dan
Details
Search results and suggestions for ”Pedro Proença” (rowiki) (289.63 KB, image/png)
2014-10-27 00:52 UTC, Dan
Details

Description Tisza Gergő 2014-07-04 05:54:49 UTC
Steps to reproduce:

1. visit http://hu.wikipedia.org
2. type "kurtvirag" in the search box

Expected: [[hu:Kürtvirág]] is suggested

Actual: no suggestions

This is particularly problematic because people often don't have access to the right type of keyboard, and have only very inconvenient ways of entering characters with diacritics. On mobile, entering diacritics is inconvenient even when the keyboard is set up correctly.

The old behavior was to drop all diacritics for indexing, which was not great, but better than the current one.

The ideal behavior would be to index both the exact and the stripped title, and give more weight to the first; so search suggestions with different diacritics would not crowd out better matches but would still appear if there is no perfect match.
Comment 1 Mikko Silvonen 2014-07-06 17:30:11 UTC
Can the new search be configured per site? There is a discussion about this problem on fiwiki as well, and one of us noticed that the search behaves differently on dewiki:

1. Go to http://de.wikipedia.org/
2. Type "aanekos" in the search box.

Result: The search suggests "Äänekoski".
Comment 2 Nik Everett 2014-07-07 13:12:03 UTC
(In reply to Tisza Gergő from comment #0)
> Steps to reproduce:
> 
> 1. visit http://hu.wikipedia.org
> 2. type "kurtvirag" in the search box
> 
> Expected: [[hu:Kürtvirág]] is suggested
> 
> Actual: no suggestions

> The ideal behavior would be to index both the exact and the stripped title,
> and give more weight to the first; so search suggestions with different
> diacritics would not crowd out better matches but would still appear if
> there is no perfect match.

Two solutions:
Better suggestions: Add an ascii normalized lookup for suggestions.  It looks like German already does this so I'd just have to figure out how and use it in more places.

Weighted search: Everywhere where we search look with the diacritics and without - with gets more boost.

Hmmm - so we already perform some weighted search: exact matches are worth more then normalized (non-conjugated, non-declined, etc) matches.  I'm worried adding another layer would be nasty from a performance perspective.  The suggestions might be faster.  I'm not really sure.  I'll have to sleep on it.


(In reply to Mikko Silvonen from comment #1)
> Can the new search be configured per site?

It certainly can.  If the language is in this list then it already is:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.


Both Finnish and Hungarian are in the list so they are getting whatever the Lucene project things are good defaults.  I'm happy to customize it from there.



In the mean time, I'm setting this to "Normal" priority.  It won't be the top of my list but its certainly on it.  Feel free to poke the priority if lack of this makes search horrible for you.
Comment 3 Andre Klapper 2014-07-19 22:46:48 UTC
*** Bug 68239 has been marked as a duplicate of this bug. ***
Comment 5 Gerrit Notification Bot 2014-10-22 13:50:54 UTC
Change 168071 had a related patch set uploaded by Manybubbles:
Prefix search always squashes accents

https://gerrit.wikimedia.org/r/168071
Comment 6 Gerrit Notification Bot 2014-10-22 16:04:00 UTC
Change 168071 merged by jenkins-bot:
Prefix search always squashes accents

https://gerrit.wikimedia.org/r/168071
Comment 7 Dan 2014-10-27 00:49:06 UTC
Created attachment 16903 [details]
Bug 67521, rowiki, testcase 1 (Loïc vs Loic)

At least on rowiki search does not propose as suggestion words which contains diacritic symbols instead of typed standard letters.
Comment 8 Dan 2014-10-27 00:52:14 UTC
Created attachment 16904 [details]
Search results and suggestions for ”Pedro Proença” (rowiki)
Comment 9 Dan 2014-10-27 19:33:05 UTC
Now is working fine on rowiki too. Ignore my previous 2 posts.
Comment 10 Nik Everett 2014-10-27 19:36:23 UTC
Sadly now we've reverted CirrusSearch due to an outage in the underlying system.  We'll reenable it once we figure out what its up.  So it'll break again.  And then we'll push this change out and rebuild the index and it should be fixed again.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links