Last modified: 2014-08-30 05:39:16 UTC
Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places. This is the log from the lucened when a string ශ්රීලංකා (Srilanka, written in Sinhala Language) is searched: 25959 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine - search wikidb: query=[ශ්රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395 ශ්රීලංකා is 0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ. The solution would be writing language specific tokenization rules in Lucene.
See also: https://issues.apache.org/jira/browse/LUCENE-2747
Actually, Lucene from 3.1 onwards has an Indic tokenizer: http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/analysis/in/IndicTokenizer.html
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
Santhosh, have you tested the results with CirrusSearch ([[mw:Search]])?