Last modified: 2014-06-25 14:02:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62299, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 60299 - CirrusSearch: If user searches with a dash in the word then filter to only words with the dash
CirrusSearch: If user searches with a dash in the word then filter to only wo...
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-21 18:54 UTC by Nik Everett
Modified: 2014-06-25 14:02 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nik Everett 2014-01-21 18:54:36 UTC
If user searches an accent squashing wiki with an accented string then only return accented results.  Example:
Search for <<clientèle>> should only find pages with <<clientèle>>
Search for <<clientele>> should find page with <<clientele>> and <<clientèle>>

Option: only enable this behaviour when a string is quoted.  Quoting is standard parlance for "please give me an exact match".  We still want quoted unaccented strings to find the accented characters.
Comment 1 Nik Everett 2014-01-21 19:04:29 UTC
Also, LuceneSearch has special handling for hyphenated words that pretty much does the same thing as I'm proposing for accents.  It looks like it only does it for "exact" tokens.  In CirrusSearch we call those "plain" tokens.  See FastWikiTokenizerEngine.java:332 for more.
Comment 2 Nik Everett 2014-02-07 14:29:09 UTC
I've opened a bug in Elasticsearch for this but it needs to be fixed in their upstream, Lucene, so I've opened a bug there and begun work.
Comment 3 Nemo 2014-04-07 15:33:35 UTC
Both upstream bugs were closed. Is anything stopping this?

(In reply to Nik Everett from comment #0)
> Search for <<clientèle>> should only find pages with <<clientèle>>
> Search for <<clientele>> should find page with <<clientele>> and
> <<clientèle>>

Can the two only be fixed together? The first may not be that important as long as exact matches come first.

On the other hand, the second has been requested repeatedly by several it.wiktionary users.
* Searching "macor" should find "mačor" but it doesn't, one has to search ma*or.
* Searching "tamen" should also find "tāmen", ideally in autocompletion suggestions too.

https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013-dic#Nuovo_motore_di_ricerca_interno
https://it.wiktionary.org/w/index.php?title=Wikizionario:Bar&diff=1673990&oldid=1673091
Comment 4 Nik Everett 2014-04-07 15:43:20 UTC
(In reply to Nemo from comment #3)
> Both upstream bugs were closed. Is anything stopping this?
> 

Just timing.  I'm upgrading the cluster tomorrow.  I don't like to merge code that won't work on the cluster so I haven't even picked this one up again since working on the upstream bug.  I'll do it, though.

> Can the two only be fixed together? The first may not be that important as
> long as exact matches come first.
> 
> On the other hand, the second has been requested repeatedly by several
> it.wiktionary users.
> * Searching "macor" should find "mačor" but it doesn't, one has to search
> ma*or.
> * Searching "tamen" should also find "tāmen", ideally in autocompletion
> suggestions too.
> 
> https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013-
> dic#Nuovo_motore_di_ricerca_interno
> https://it.wiktionary.org/w/index.php?title=Wikizionario:
> Bar&diff=1673990&oldid=1673091

Only English has any sort of accent squashing at this point.  I can build it for you soon.

The bug is actually about the first.  Right now, in English, if you search for "mačor" you'll get "macor" which frustrates some folks.
Comment 5 Gerrit Notification Bot 2014-04-17 15:02:10 UTC
Change 126995 had a related patch set uploaded by Manybubbles:
Quoted searches with accents only find accented

https://gerrit.wikimedia.org/r/126995
Comment 6 Nik Everett 2014-04-17 15:03:30 UTC
That only covers the accent squashing, not the dashes.  Dashes are harder, I think.  I'll have to think about them some more....
Comment 7 Gerrit Notification Bot 2014-04-17 21:34:13 UTC
Change 126995 merged by jenkins-bot:
Quoted searches with accents only find accented

https://gerrit.wikimedia.org/r/126995
Comment 8 Nik Everett 2014-06-25 14:02:56 UTC
Allowing dash searching using regexes which are being deployed now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links