Last modified: 2011-11-30 00:41:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34712, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32712 - External links surrounded by unicode quotation marks break search index
External links surrounded by unicode quotation marks break search index
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.18.x
All All
: Unprioritized major (vote)
: ---
Assigned To: Brion Vibber
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-29 23:02 UTC by André Köthur
Modified: 2011-11-30 00:41 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Testscript to demonstrate preg_replace misbehavior (357 bytes, text/php)
2011-11-29 23:09 UTC, André Köthur
Details

Description André Köthur 2011-11-29 23:02:28 UTC
When a page contains an external Link which is surrounded by unicode quotation marks (U+201E double low-9 quotation mark and U+201C left double quotation mark), then the article's entry in the searchindex table (field si_text) will be an empty string.

Reproduce: Just add the following text to an article and save/update fulltext index:

„http://example.com“


I've done some investigation.
I found out that the first problem arises in includes/search/SearchUpdate.php starting at line 64 where external URLs should be stripped. preg_replace destroys the trailing quotation mark and leaves illegal unicode sequence in $text. At some later stage in processing $text gets truncated to an empty string, presumably because of the illegal unicode sequence.
Comment 1 André Köthur 2011-11-29 23:09:16 UTC
Created attachment 9574 [details]
Testscript to demonstrate preg_replace misbehavior
Comment 2 Brion Vibber 2011-11-29 23:10:41 UTC
Yeah, that regex'll be breaking off partway through the e2 80 9c sequence for the closing quote.

Need to either change it to proper unicode support, or let it take anything \x80-\xff. This regex dates back to at least 2003, when we still didn't have UTF-8 on everything. :P
Comment 3 Brion Vibber 2011-11-29 23:11:41 UTC
Oh handy -- should be possible to turn that test script into a PHPUnit test case! See https://www.mediawiki.org/wiki/Manual:PHP_unit_testing for some background.
Comment 4 Brion Vibber 2011-11-30 00:41:25 UTC
Fixed in r104635 / r104636 on trunk, including a unit test. Thanks for the example!

Merged to REL1_18 branch for 1.18.1 in r104637.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links