Last modified: 2014-09-20 06:27:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72950, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70950 - CirrusSearch should provide a way to find hyphenated words, as Lucene-search always has
CirrusSearch should provide a way to find hyphenated words, as Lucene-search ...
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: High enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-17 18:46 UTC by SpontaneousGrumbler
Modified: 2014-09-20 06:27 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description SpontaneousGrumbler 2014-09-17 18:46:04 UTC
Before CirrusSearch can be considered a replacement for Lucene-search (not a downgrade), it needs to be able to find hyphenated words, such as "he was assigned to follow-up on the discovery", without finding "follow up". Lucene-search finds both hyphenated and unhyphenated forms if "follow up" is searched; it only finds the hyphenated form if "follow-up" searched. This allows editors to find and fix cases of improper punctuation. This change will allow CirrusSearch to match what Lucene-search does now. Even nicer would be to provide a way find an actual space and not hyphenation, such as "He was well-known in Europe."
Comment 1 Nemo 2014-09-19 11:02:15 UTC
(In reply to SpontaneousGrumbler from comment #0)
> find hyphenated words, such as "he was
> assigned to follow-up on the discovery", without finding "follow up".

It's not about hyphens specifically, tweaked summary. Currently, you can use "insource:".

This has been discussed at <https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/%22Really%22_exact_matches> and <https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:> is now a bit clearer (while <https://www.mediawiki.org/wiki/Help:CirrusSearch#Quotes_and_exact_matches> is probably a bit confusing).

Certainly "search the pre-tokenized version of the source" is not particularly clear...
Comment 2 SpontaneousGrumbler 2014-09-19 17:22:47 UTC
(In reply to Nemo from comment #1)
> (In reply to SpontaneousGrumbler from comment #0)
> > find hyphenated words, such as "he was
> > assigned to follow-up on the discovery", without finding "follow up".
> 
> It's not about hyphens specifically, tweaked summary. Currently, you can use
> "insource:".
> 
> This has been discussed at
> <https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/
> %22Really%22_exact_matches> and
> <https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:> is now a bit
> clearer (while
> <https://www.mediawiki.org/wiki/Help:CirrusSearch#Quotes_and_exact_matches>
> is probably a bit confusing).
> 
> Certainly "search the pre-tokenized version of the source" is not
> particularly clear...

Anyone else have vertigo after following the discussion from the CirrusSearch help page, where this was first brought up, then to the Bugzilla report, then from there back to the CirrusSearch help page as a proposed solution? The insource: feature is no help at all for this problem. The regex flavor runs for a long time and then falls off the edge of the earth. The other flavor doesn't pay any more attention to hyphens than the straightforward search. Let's stop dodging the issue and get to work fixing the problem. How do I get the summary changed back to "CirrusSearch should provide a way to find hyphenated words"? The updated summary about "exact match" seems to be a setup for deflecting this back to some pie-in-the-sky solution using the insource: feature.
Comment 3 Nemo 2014-09-19 17:43:35 UTC
(In reply to SpontaneousGrumbler from comment #2)
> The updated summary about "exact
> match" seems to be a setup for deflecting this back to some pie-in-the-sky
> solution using the insource: feature.

Haha, well put, maybe you're right: but I think not. I changed the summary to cover the other users' scenario as well because (long story short) I think the ElasticSearch "feature" doing this is the same.
Comment 4 Nik Everett 2014-09-19 17:50:22 UTC
I think he's right in that hyphenated words are the only thing that lsearchd has special handling for.  There could be more - the code is vast and I haven't read it all - but I don't think there are.  I've set the summary back to how SpontaneousGrumbler@gmail.com originally filed it.  Are there any constructs other than hyphenated words that have this problem?

The problem with adding lsearchd's support for hyphenated words to Cirrus is that it relies on some pretty gnarly hacks that we can't easily replicate.  My hope was that regexes would give you more power to find more things and that they'd be tolerably fast.

At this point I'm not willing to reimplement the hyphenation hack - its just too much work and it only handles the hyphens.  I'm very happy to work to make the regex search faster.  Adding another clause (<<insource:"follow-up" insource:/follow-up/>> for example) speeds it up but if there are other regex searches in front of you (there is a queue that all users share) it gets slow again.  I can certainly work on that.

Even when Cirrus is the primary search backend for enwiki you'll still be able to use lsearchd for a few months with a url parameter (&srbackend=LuceneSearch) and we'll monitor which queries still hit that system before we disable it entirely.  We're in no hurry there.

As to the discussion being in three places - I'm not sure what to say.  I have trouble keeping track of anything outside of bugzilla.
Comment 5 Chad H. 2014-09-19 18:14:32 UTC
(In reply to Nik Everett from comment #4)
> I think he's right in that hyphenated words are the only thing that lsearchd
> has special handling for.  There could be more - the code is vast and I
> haven't read it all - but I don't think there are.  I've set the summary
> back to how SpontaneousGrumbler@gmail.com originally filed it.  Are there
> any constructs other than hyphenated words that have this problem?
> 

In English I can't think of any, but I'd really like to look further into what lsearchd is doing here. I don't think the original request is unreasonable, although I agree that it's not the most straightforward thing for us to implement.

> I'm very happy to work to
> make the regex search faster.  Adding another clause (<<insource:"follow-up"
> insource:/follow-up/>> for example) speeds it up but if there are other
> regex searches in front of you (there is a queue that all users share) it
> gets slow again.  I can certainly work on that.
> 

We can always improve insource :)
Comment 6 Mikhail Ryazanov 2014-09-20 06:27:27 UTC
(In reply to Chad H. from comment #5)
> In English I can't think of any, but I'd really like to look further into
> what lsearchd is doing here. I don't think the original request is
> unreasonable, although I agree that it's not the most straightforward thing
> for us to implement.

English is not the only language in the world. ;–) But even for it, for example, capitalization is another important "exact" thing.

Other things from my experience: some strange people might write, for example, "km\h" instead of "km/h"; sometimes hyphens and dashes are confused in compound words; it might be useful to distinguish between phrases (with spaces), URLs (with dots) or emails and some fancy names (such as "Folding@home").

I don't think that it is very difficult to add a post-filter to the current "exact search" that will check for "truly exact" (character-wise) matches. It shouldn't be difficult to add some modifiers (in the spirit of current "~") to trigger this behavior.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links