Last modified: 2014-02-21 22:45:14 UTC
Make sure all sorts of apostrophies count as word breaks. In particular, “L’Oréal”, “L Oréal”, and “L'Oréal” really ought to map to the same terms. Since there is a space in one of the terms, the only sane way to do that is to map them to two terms.
I know this is right for English, but maybe/probably not other languages.
(In reply to comment #1) > I know this is right for English, but maybe/probably not other languages. This is right for French: apostrophes in this language are basically the elision of a vowel and a space.
(In reply to comment #2) > (In reply to comment #1) > > I know this is right for English, but maybe/probably not other languages. > > This is right for French: apostrophes in this language are basically the > elision of a vowel and a space. The new search has a special filter to handle French's elision. Here it is: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-elision-tokenfilter.html . I'll crack open the code and see what it does when I start work on this bug.
(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > I know this is right for English, but maybe/probably not other languages. > > > > This is right for French: apostrophes in this language are basically the > > elision of a vowel and a space. > > The new search has a special filter to handle French's elision. Here it is: > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ > analysis-elision-tokenfilter.html > . I'll crack open the code and see what it does when I start work on this > bug. This new filter seems great. (Your link doesn’t mention “d’” as a stop word, it will be worth the check when you hack the code.) I’ve done some search tests on frwikisource and it appears that: — apostrophes “'” and “’” are indeed interchangeable in the new Elasticsearch: priority is given to the apostrophe typed in the search box, but the other one is returned as well (e.g. the search “l'art d'avoir raison stratagème” first returns a redirection page, but also every occurrence of “L’Art d’avoir toujours raison”); although I don’t think that it’s due to the elision token filter: the search “Morestal lorsqu'il” returns the same result as “Morestal lorsqu’il”, even if “lorsqu” is not in this filter; — despite this filter, apostrophes in french stop words don’t seem to break words either: the search “avoir toujours raison” doesn’t return “L’Art d’avoir toujours raison”, and the input “art d’avoir toujours raison” returns it but “Art” in the search result is not bolded.
In German we are using apostrophes much like in English and French. You can write "what is" as "what's" in English and "ist es" as "ist’s" in German. That's always two words. A special example is "Peter’s Bar". That's actually wrong in German. It must be written as "Peters Bar". However, in both cases the "s" is not part of the name. So the conclusion is the same: two words. In German we prefer U+2019 over every other character. However, people tend to misuse many other characters including U+0027, U+0060, U+00B4 and others.