Last modified: 2014-10-27 19:36:23 UTC
Steps to reproduce: 1. visit http://hu.wikipedia.org 2. type "kurtvirag" in the search box Expected: [[hu:Kürtvirág]] is suggested Actual: no suggestions This is particularly problematic because people often don't have access to the right type of keyboard, and have only very inconvenient ways of entering characters with diacritics. On mobile, entering diacritics is inconvenient even when the keyboard is set up correctly. The old behavior was to drop all diacritics for indexing, which was not great, but better than the current one. The ideal behavior would be to index both the exact and the stripped title, and give more weight to the first; so search suggestions with different diacritics would not crowd out better matches but would still appear if there is no perfect match.
Can the new search be configured per site? There is a discussion about this problem on fiwiki as well, and one of us noticed that the search behaves differently on dewiki: 1. Go to http://de.wikipedia.org/ 2. Type "aanekos" in the search box. Result: The search suggests "Äänekoski".
(In reply to Tisza Gergő from comment #0) > Steps to reproduce: > > 1. visit http://hu.wikipedia.org > 2. type "kurtvirag" in the search box > > Expected: [[hu:Kürtvirág]] is suggested > > Actual: no suggestions > The ideal behavior would be to index both the exact and the stripped title, > and give more weight to the first; so search suggestions with different > diacritics would not crowd out better matches but would still appear if > there is no perfect match. Two solutions: Better suggestions: Add an ascii normalized lookup for suggestions. It looks like German already does this so I'd just have to figure out how and use it in more places. Weighted search: Everywhere where we search look with the diacritics and without - with gets more boost. Hmmm - so we already perform some weighted search: exact matches are worth more then normalized (non-conjugated, non-declined, etc) matches. I'm worried adding another layer would be nasty from a performance perspective. The suggestions might be faster. I'm not really sure. I'll have to sleep on it. (In reply to Mikko Silvonen from comment #1) > Can the new search be configured per site? It certainly can. If the language is in this list then it already is: arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai. Both Finnish and Hungarian are in the list so they are getting whatever the Lucene project things are good defaults. I'm happy to customize it from there. In the mean time, I'm setting this to "Normal" priority. It won't be the top of my list but its certainly on it. Feel free to poke the priority if lack of this makes search horrible for you.
*** Bug 68239 has been marked as a duplicate of this bug. ***
This problem was also reported on Portuguese Wikipedia. See https://pt.wikipedia.org/wiki/WP:Esplanada/geral/Sobre_o_campo_de_busca_%2813out2014%29?oldid=40284754 https://commons.wikimedia.org/wiki/File:Ptwiki-print-exemplo-busca-acento.png for a few examples.
Change 168071 had a related patch set uploaded by Manybubbles: Prefix search always squashes accents https://gerrit.wikimedia.org/r/168071
Change 168071 merged by jenkins-bot: Prefix search always squashes accents https://gerrit.wikimedia.org/r/168071
Created attachment 16903 [details] Bug 67521, rowiki, testcase 1 (Loïc vs Loic) At least on rowiki search does not propose as suggestion words which contains diacritic symbols instead of typed standard letters.
Created attachment 16904 [details] Search results and suggestions for ”Pedro Proença” (rowiki)
Now is working fine on rowiki too. Ignore my previous 2 posts.
Sadly now we've reverted CirrusSearch due to an outage in the underlying system. We'll reenable it once we figure out what its up. So it'll break again. And then we'll push this change out and rebuild the index and it should be fixed again.