Last modified: 2014-08-30 05:39:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T33135, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 31135 - Lucene tokenization is wrong for Indic languages


Summary:	Lucene tokenization is wrong for Indic languages

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	lucene-search-2 (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n, upstream

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-09-24 13:56 UTC by Santhosh Thottingal
Modified:	2014-08-30 05:39 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Santhosh Thottingal 2011-09-24 13:56:52 UTC

Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places.

This is the log from the lucened when a string ශ්‍රීලංකා (Srilanka, written in Sinhala Language) is searched:

25959 [pool-2-thread-1] INFO  org.wikimedia.lsearch.search.SearchEngine  - search wikidb: query=[ශ්‍රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395


ශ්‍රීලංකා is  0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF 
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA

The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ.

The solution would be writing language specific tokenization rules in Lucene.

Comment 1 Mark A. Hershberger 2011-09-24 17:57:59 UTC

See also: https://issues.apache.org/jira/browse/LUCENE-2747

Comment 2 Diederik van Liere 2011-11-28 21:33:21 UTC

Actually, Lucene from 3.1 onwards has an Indic tokenizer: http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/analysis/in/IndicTokenizer.html

Comment 3 Andre Klapper 2013-03-26 11:20:16 UTC

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

Comment 4 Nemo 2014-08-30 05:39:16 UTC

Santhosh, have you tested the results with CirrusSearch ([[mw:Search]])?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links