Last modified: 2014-11-19 18:13:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T75605, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 73605 - No normalization for ancient greek accents in searches
No normalization for ancient greek accents in searches
Status: NEW
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-11-19 11:57 UTC by paolo anghileri
Modified: 2014-11-19 18:13 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description paolo anghileri 2014-11-19 11:57:39 UTC
I am a PHP developer trying to use mediawiki for an ancient greek dictionary.
One feature this wictionary should have is the possibility to search for a word without the input of accents and diacritic letters and retrieve all the relative words that contain diacritic in search results.

For instance, if I input the green world αλφα (alpha) it shoud retrieve also ἄλφα (with diacritics), if it is present in article database.
This happens in modern greek wiktionary for words with accents, but it does not seem to work for ancient greek, cince it has different kind of diacritics.

My question is about the availability of this feature.
In case thhis feature is not available, my need is to have indication about the best way to implement it.

Paolo
Comment 1 Andre Klapper 2014-11-19 12:43:13 UTC
Thanks for taking the time to report this!

I tried the search on https://el.wikipedia.org (which uses the CirrusSearch extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
Which search backend/extension do you use? Which MediaWiki version is this?
Comment 2 Nik Everett 2014-11-19 13:25:01 UTC
Cirrus uses Elasticsearch for the anlaysis which in turn uses Apache Lucene.  I imagine the right place to implement this is there.

It looks like https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java implements the normalization.  I'd file a bug over there.  It doesn't _look_ like adding the extra normalization would be that hard.  I suppose you'd have to decide with them whether they should be enabled by default (so you could just add them to that file) or optional.  If optional you'd just make a new filter I believe.

After its released in Lucene and Elasticsearch we could enable it by default for Greek across the site I think.
Comment 3 paolo anghileri 2014-11-19 17:13:54 UTC
(In reply to Andre Klapper from comment #1)
> Thanks for taking the time to report this!
> 
> I tried the search on https://el.wikipedia.org (which uses the CirrusSearch
> extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
> Which search backend/extension do you use? Which MediaWiki version is this?

Thank you Andre for the reply.
This is the same situation I have found in my searches


My need is being able to search and retrieve ancient greek worlds even with vowels ortographical details specified ( άλφα searchstring retrtieves άλφα, αλφα and άλφα) and without vowels ortograhical details specified (αλφα searchstring retrtieves άλφα, αλφα and άλφα)

The fact it works for modern greek but not for ancient suggest me that in this case ancient greek is not supported, while modern, which has different ortographical details, works.
Comment 4 paolo anghileri 2014-11-19 17:22:39 UTC
(In reply to Andre Klapper from comment #1)
> Thanks for taking the time to report this!
> 
> I tried the search on https://el.wikipedia.org (which uses the CirrusSearch
> extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
> Which search backend/extension do you use? Which MediaWiki version is this?

About the second part of the question, I am at a first preliminary step for this project and did not install a mediawiki for this at the moment, so I made tests only on public mediawiki instances for the moment, for instance el.wiktionary.org

I will do local test in the next days. About search backend or extensions do you have any suggestions?

Thanks again

Paolo
Comment 5 paolo anghileri 2014-11-19 17:37:43 UTC
(In reply to Nik Everett from comment #2)

Thank you Nik, I had a look at that file. 
I am not an experienced mediawiki developer, but if the problem is really related to that, maybe I can provide some help in adding extra normalization.

Thanks

Paolo
Comment 6 Nik Everett 2014-11-19 17:46:28 UTC
(In reply to paolo anghileri from comment #5)


If you want to propose a change to implement it in Lucene then link it here and I'll jump over there and help.  I'm not a Lucene committer but I can certainly review it and prod a committer.

(In reply to paolo anghileri from comment #4)
> I will do local test in the next days. About search backend or extensions do
> you have any suggestions?

Use CirrusSearch.  Its the search backend that we use on all of our wikis.  Its better than the built in MySQL search in just about every way.  Its the only option to get that normalization from Lucene to take effect as well.
Comment 7 paolo anghileri 2014-11-19 18:06:56 UTC
(In reply to Nik Everett from comment #6)

Provided I am not a wikimedia expert and did not explore yet CirruSearch code, as a CirruSearch developer do you think this normalization should go through Lucene or is it possible to implement it direcly in CirrusSearch extension, or maybe in its dependency elasticsearch?

Otherwise, if this can be done only passing through Lucene, I'll try adding extra normalization in Lucene and propose a commitment for that.
Comment 8 Nik Everett 2014-11-19 18:09:58 UTC
(In reply to paolo anghileri from comment #7)
> (In reply to Nik Everett from comment #6)
> 
> Provided I am not a wikimedia expert and did not explore yet CirruSearch
> code, as a CirruSearch developer do you think this normalization should go
> through Lucene or is it possible to implement it direcly in CirrusSearch
> extension, or maybe in its dependency elasticsearch?
> 
> Otherwise, if this can be done only passing through Lucene, I'll try adding
> extra normalization in Lucene and propose a commitment for that.

Try getting it in Lucene.  Anything in Cirrus would be a nasty hack.
Comment 9 paolo anghileri 2014-11-19 18:13:38 UTC
(In reply to Nik Everett from comment #8)

Thanks Nik, I'll try following this way.
As you suggested I'll provide you a link for the Lucene commitment here soon, so you can review it.

Thanks for your suggestions

Paolo

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links