Last modified: 2014-08-30 05:39:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T33135, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 31135 - Lucene tokenization is wrong for Indic languages
Lucene tokenization is wrong for Indic languages
Status: NEW
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n, upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-09-24 13:56 UTC by Santhosh Thottingal
Modified: 2014-08-30 05:39 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Santhosh Thottingal 2011-09-24 13:56:52 UTC
Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places.

This is the log from the lucened when a string ශ්‍රීලංකා (Srilanka, written in Sinhala Language) is searched:

25959 [pool-2-thread-1] INFO  org.wikimedia.lsearch.search.SearchEngine  - search wikidb: query=[ශ්‍රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395


ශ්‍රීලංකා is  0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF 
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA

The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ.

The solution would be writing language specific tokenization rules in Lucene.
Comment 1 Mark A. Hershberger 2011-09-24 17:57:59 UTC
See also: https://issues.apache.org/jira/browse/LUCENE-2747
Comment 2 Diederik van Liere 2011-11-28 21:33:21 UTC
Actually, Lucene from 3.1 onwards has an Indic tokenizer: http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/analysis/in/IndicTokenizer.html
Comment 3 Andre Klapper 2013-03-26 11:20:16 UTC
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
Comment 4 Nemo 2014-08-30 05:39:16 UTC
Santhosh, have you tested the results with CirrusSearch ([[mw:Search]])?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links