Last modified: 2014-02-21 22:45:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60701, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58701 - CirrusSearch: Make sure all sorts of apostrophies count as word breaks
CirrusSearch: Make sure all sorts of apostrophies count as word breaks
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-19 19:13 UTC by Nik Everett
Modified: 2014-02-21 22:45 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nik Everett 2013-12-19 19:13:29 UTC
Make sure all sorts of apostrophies count as word breaks.  In particular,  “L’Oréal”, “L Oréal”, and “L'Oréal” really ought to map to the same terms.  Since there is a space in one of the terms, the only sane way to do that is to map them to two terms.
Comment 1 Nik Everett 2013-12-19 19:49:19 UTC
I know this is right for English, but maybe/probably not other languages.
Comment 2 François Martin 2013-12-19 22:26:01 UTC
(In reply to comment #1)
> I know this is right for English, but maybe/probably not other languages.

This is right for French: apostrophes in this language are basically the elision of a vowel and a space.
Comment 3 Nik Everett 2013-12-20 01:38:48 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > I know this is right for English, but maybe/probably not other languages.
> 
> This is right for French: apostrophes in this language are basically the
> elision of a vowel and a space.

The new search has a special filter to handle French's elision.  Here it is:  http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-elision-tokenfilter.html .  I'll crack open the code and see what it does when I start work on this bug.
Comment 4 François Martin 2013-12-20 10:01:38 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > I know this is right for English, but maybe/probably not other languages.
> > 
> > This is right for French: apostrophes in this language are basically the
> > elision of a vowel and a space.
> 
> The new search has a special filter to handle French's elision.  Here it is: 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/
> analysis-elision-tokenfilter.html
> .  I'll crack open the code and see what it does when I start work on this
> bug.

This new filter seems great. (Your link doesn’t mention “d’” as a stop word, it will be worth the check when you hack the code.)
I’ve done some search tests on frwikisource and it appears that:

— apostrophes “'” and “’” are indeed interchangeable in the new Elasticsearch: priority is given to the apostrophe typed in the search box, but the other one is returned as well (e.g. the search “l'art d'avoir raison stratagème” first returns a redirection page, but also every occurrence of “L’Art d’avoir toujours raison”); although I don’t think that it’s due to the elision token filter: the search “Morestal lorsqu'il” returns the same result as “Morestal lorsqu’il”, even if “lorsqu” is not in this filter;

— despite this filter, apostrophes in french stop words don’t seem to break words either: the search “avoir toujours raison” doesn’t return “L’Art d’avoir toujours raison”, and the input “art d’avoir toujours raison” returns it but “Art” in the search result is not bolded.
Comment 5 TMg 2014-01-17 21:38:36 UTC
In German we are using apostrophes much like in English and French. You can write "what is" as "what's" in English and "ist es" as "ist’s" in German. That's always two words.

A special example is "Peter’s Bar". That's actually wrong in German. It must be written as "Peters Bar". However, in both cases the "s" is not part of the name. So the conclusion is the same: two words.

In German we prefer U+2019 over every other character. However, people tend to misuse many other characters including U+0027, U+0060, U+00B4 and others.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links