Last modified: 2011-09-09 18:32:34 UTC
If you have a page called "Österreich" the "Ö" or in general "Öst"+ are not highlighted in the suggesting search interface. 1. When the UTF-8 character is not the first character, highlights work. 2. Only results starting with a mere ASCII (non UTF-8) character are correctly highlighted. 2. If the UTF-8 characters are not in between other highlighted characters, they are not highlighted. Page name - observation Österreich - no highlight, when entering ö - s - t... Niederösterreich - highlights ok See also Extension:Vector which extends/users jquery.suggestions.js .
Created attachment 8645 [details] Screenshot showing highlight vs non-highlight Screenshot showing the problem -- the matching chars should be highlighted in the entries in the drop-down, but where we start with a non-ascii char it doesn't match up.
Looks like the actual highlighting is passed through from jquery.suggestions to jquery.autoEllipsis through to jquery.highlightText where a regex is used: // TODO - need to be smarter about the character matching here. // non latin characters can make regex think a new word has begun. // look for an occurence of our pattern and store the starting position var pos = node.data.search( new RegExp( "\\b" + $.escapeRE( pat ), "i" ) ); Looks like the \b (word break) gets confused at the 'Ö' despite being a legit word character. WTF? :(
My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_] (aka what \w matches) to be word characters, so Ö is technically not part of a word character, thus is really is a word boundary. See http://bclary.com/2004/11/07/#a-15.10.2.6
(In reply to comment #3) > My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_] > (aka what \w matches) to be word characters, so Ö is technically not part of a > word character, thus is really is a word boundary. See > http://bclary.com/2004/11/07/#a-15.10.2.6 PHP - but not Javascript - has multi-byte aware mb_ functions. I am pretty sure, you know this, do you?
JavaScript strings are UTF-16, and hence inherently aware of Unicode. In general though the substring matches here are kinda funky as well, as the suggestion engine might or might not actually be doing simple substring matches.
(In reply to comment #3) > My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_] > (aka what \w matches) to be word characters, so Ö is technically not part of a > word character, thus is really is a word boundary. See > http://bclary.com/2004/11/07/#a-15.10.2.6 I also think that it has to do with \w definition
fixed in r90092
see http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular-expressions
Fix deployed. Typing "Öster" into the enwiki search box now works as expected.
yes, in de.wikipedia. but not yet in en.wikipedia (why?) dewiki 1.17wmf1 (Version 96617) enwiki 1.17wmf1 (r96617)
WFM per comment 9. Have you tried clearing your browser cache?
iuiuiuuuiuiu, highlighting works (=> closing this bug now) but search suggestions (for "Ös" in enwiki) are strange: Ös ==> Oslo Osaka ... ... Östereich (correctly highlighted) ... I guess, this has to do with TitleKey and Transliteration ? Where can I find some documentataion about this ?
(In reply to comment #12) > I guess, this has to do with TitleKey and Transliteration ? Where can I find > some documentataion about this ? I would guess this is TitleKey's doing, yeah. Not sure where you'd find docs other than in TitleKey.