Last modified: 2014-06-25 14:02:56 UTC
If user searches an accent squashing wiki with an accented string then only return accented results. Example: Search for <<clientèle>> should only find pages with <<clientèle>> Search for <<clientele>> should find page with <<clientele>> and <<clientèle>> Option: only enable this behaviour when a string is quoted. Quoting is standard parlance for "please give me an exact match". We still want quoted unaccented strings to find the accented characters.
Also, LuceneSearch has special handling for hyphenated words that pretty much does the same thing as I'm proposing for accents. It looks like it only does it for "exact" tokens. In CirrusSearch we call those "plain" tokens. See FastWikiTokenizerEngine.java:332 for more.
I've opened a bug in Elasticsearch for this but it needs to be fixed in their upstream, Lucene, so I've opened a bug there and begun work.
Both upstream bugs were closed. Is anything stopping this? (In reply to Nik Everett from comment #0) > Search for <<clientèle>> should only find pages with <<clientèle>> > Search for <<clientele>> should find page with <<clientele>> and > <<clientèle>> Can the two only be fixed together? The first may not be that important as long as exact matches come first. On the other hand, the second has been requested repeatedly by several it.wiktionary users. * Searching "macor" should find "mačor" but it doesn't, one has to search ma*or. * Searching "tamen" should also find "tāmen", ideally in autocompletion suggestions too. https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013-dic#Nuovo_motore_di_ricerca_interno https://it.wiktionary.org/w/index.php?title=Wikizionario:Bar&diff=1673990&oldid=1673091
(In reply to Nemo from comment #3) > Both upstream bugs were closed. Is anything stopping this? > Just timing. I'm upgrading the cluster tomorrow. I don't like to merge code that won't work on the cluster so I haven't even picked this one up again since working on the upstream bug. I'll do it, though. > Can the two only be fixed together? The first may not be that important as > long as exact matches come first. > > On the other hand, the second has been requested repeatedly by several > it.wiktionary users. > * Searching "macor" should find "mačor" but it doesn't, one has to search > ma*or. > * Searching "tamen" should also find "tāmen", ideally in autocompletion > suggestions too. > > https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013- > dic#Nuovo_motore_di_ricerca_interno > https://it.wiktionary.org/w/index.php?title=Wikizionario: > Bar&diff=1673990&oldid=1673091 Only English has any sort of accent squashing at this point. I can build it for you soon. The bug is actually about the first. Right now, in English, if you search for "mačor" you'll get "macor" which frustrates some folks.
Change 126995 had a related patch set uploaded by Manybubbles: Quoted searches with accents only find accented https://gerrit.wikimedia.org/r/126995
That only covers the accent squashing, not the dashes. Dashes are harder, I think. I'll have to think about them some more....
Change 126995 merged by jenkins-bot: Quoted searches with accents only find accented https://gerrit.wikimedia.org/r/126995
Allowing dash searching using regexes which are being deployed now.