Last modified: 2011-11-30 00:41:25 UTC
When a page contains an external Link which is surrounded by unicode quotation marks (U+201E double low-9 quotation mark and U+201C left double quotation mark), then the article's entry in the searchindex table (field si_text) will be an empty string. Reproduce: Just add the following text to an article and save/update fulltext index: „http://example.com“ I've done some investigation. I found out that the first problem arises in includes/search/SearchUpdate.php starting at line 64 where external URLs should be stripped. preg_replace destroys the trailing quotation mark and leaves illegal unicode sequence in $text. At some later stage in processing $text gets truncated to an empty string, presumably because of the illegal unicode sequence.
Created attachment 9574 [details] Testscript to demonstrate preg_replace misbehavior
Yeah, that regex'll be breaking off partway through the e2 80 9c sequence for the closing quote. Need to either change it to proper unicode support, or let it take anything \x80-\xff. This regex dates back to at least 2003, when we still didn't have UTF-8 on everything. :P
Oh handy -- should be possible to turn that test script into a PHPUnit test case! See https://www.mediawiki.org/wiki/Manual:PHP_unit_testing for some background.
Fixed in r104635 / r104636 on trunk, including a unit test. Thanks for the example! Merged to REL1_18 branch for 1.18.1 in r104637.