Last modified: 2011-09-09 18:32:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31371, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 29371 - jQuery.suggestions.js highlighting of UTF-8 characters like "äüöß" does not work if such a non-ASCII is first character


Summary:	jQuery.suggestions.js highlighting of UTF-8 characters like "äüöß" does not w...

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	JavaScript (Other open bugs)
Version:	1.20.x
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	T. Gries

URL:
Whiteboard:
Keywords:

Depends on:	29368
Blocks:
	Show dependency tree / graph

Reported:	2011-06-13 12:02 UTC by T. Gries
Modified:	2011-09-09 18:32 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Screenshot showing highlight vs non-highlight (9.04 KB, image/png) 2011-06-13 17:16 UTC, Brion Vibber	Details
Add an attachment (proposed patch, testcase, etc.)

Description T. Gries 2011-06-13 12:02:18 UTC

If you have a page called "Österreich" the "Ö" or in general "Öst"+ are not highlighted in the suggesting search interface.

1. When the UTF-8 character is not the first character, highlights work.
2. Only results starting with a mere ASCII (non UTF-8) character are correctly highlighted.
2. If the UTF-8 characters are not in between other highlighted characters, they are not highlighted.

Page name - observation

Österreich - no highlight, when entering ö - s - t...
Niederösterreich - highlights ok

See also Extension:Vector which extends/users jquery.suggestions.js .

Comment 1 Brion Vibber 2011-06-13 17:16:57 UTC

Created attachment 8645 [details]
Screenshot showing highlight vs non-highlight

Screenshot showing the problem -- the matching chars should be highlighted in the entries in the drop-down, but where we start with a non-ascii char it doesn't match up.

Comment 2 Brion Vibber 2011-06-13 17:25:25 UTC

Looks like the actual highlighting is passed through from jquery.suggestions to jquery.autoEllipsis through to jquery.highlightText where a regex is used:

	// TODO - need to be smarter about the character matching here. 
	// non latin characters can make regex think a new word has begun. 
	// look for an occurence of our pattern and store the starting position
	var pos = node.data.search( new RegExp( "\\b" + $.escapeRE( pat ), "i" ) );

Looks like the \b (word break) gets confused at the 'Ö' despite being a legit word character. WTF? :(

Comment 3 Bawolff (Brian Wolff) 2011-06-13 19:44:59 UTC

My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_] (aka what \w matches) to be word characters, so Ö is technically not part of a word character, thus is really is a word boundary. See http://bclary.com/2004/11/07/#a-15.10.2.6

Comment 4 T. Gries 2011-06-13 21:51:09 UTC

(In reply to comment #3)
> My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_]
> (aka what \w matches) to be word characters, so Ö is technically not part of a
> word character, thus is really is a word boundary. See
> http://bclary.com/2004/11/07/#a-15.10.2.6

PHP - but not Javascript - has multi-byte aware mb_ functions. I am pretty sure, you know this, do you?

Comment 5 Brion Vibber 2011-06-13 21:54:07 UTC

JavaScript strings are UTF-16, and hence inherently aware of Unicode.

In general though the substring matches here are kinda funky as well, as the suggestion engine might or might not actually be doing simple substring matches.

Comment 6 T. Gries 2011-06-13 21:57:17 UTC

(In reply to comment #3)
> My understanding is that in javascript regex's, \b only considers [a-zA-Z0-9_]
> (aka what \w matches) to be word characters, so Ö is technically not part of a
> word character, thus is really is a word boundary. See
> http://bclary.com/2004/11/07/#a-15.10.2.6

I also think that it has to do with \w definition

Comment 7 T. Gries 2011-06-14 21:43:06 UTC

fixed in r90092

Comment 8 T. Gries 2011-06-14 21:58:11 UTC

see 
http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular-expressions

Comment 9 Roan Kattouw 2011-09-09 11:44:06 UTC

Fix deployed. Typing "Öster" into the enwiki search box now works as expected.

Comment 10 T. Gries 2011-09-09 12:06:22 UTC

yes, in de.wikipedia.
but not yet in en.wikipedia (why?)

dewiki 1.17wmf1 (Version 96617)
enwiki 1.17wmf1 (r96617)

Comment 11 Roan Kattouw 2011-09-09 12:08:32 UTC

WFM per comment 9. Have you tried clearing your browser cache?

Comment 12 T. Gries 2011-09-09 12:19:20 UTC

iuiuiuuuiuiu, highlighting works (=> closing this bug now)


but search suggestions (for "Ös" in enwiki) are strange:

Ös

==>

Oslo
Osaka

...
...
Östereich (correctly highlighted)
...


I guess, this has to do with TitleKey and Transliteration ? Where can I find some documentataion about this ?

Comment 13 Roan Kattouw 2011-09-09 18:32:34 UTC

(In reply to comment #12)
> I guess, this has to do with TitleKey and Transliteration ? Where can I find
> some documentataion about this ?
I would guess this is TitleKey's doing, yeah. Not sure where you'd find docs other than in TitleKey.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links