Last modified: 2014-08-18 14:47:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71226, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69226 - Phrase matching with stemming in CirrusSearch
Phrase matching with stemming in CirrusSearch
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-07 11:02 UTC by Edward Betts
Modified: 2014-08-18 14:47 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Edward Betts 2014-08-07 11:02:55 UTC
Here is my example query: "station box" AND Helsinki

If I try this search on English Wikipedia I get 'Helsinki Metro' as result with LuceneSearch, but no results with CirrusSearch.

The wiki text contains this: "Two [[station box]]es were constructed in Hakaniemi."

Stemming in phrase searches in LuceneSearch was a bug, but now I have code that depends on this bug.

I found that Bug 54020, requested this change, disabling stemming in phrase matches.

It would be useful if it were possible to use CirrusSearch to search for terms next to each, like a phrase search, but with stemming. The syntax doesn't need to be the same as LuceneSearch. Stemming in phrase matching could be a tick box in advanced search and/or an extra parameter in the search API.
Comment 1 Nik Everett 2014-08-07 11:42:09 UTC
Already done:  <<"station box"~ helsinki>>.

There is documentation for this but its kind of buried: https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures#Quotes_and_exact_matches
Comment 2 Edward Betts 2014-08-07 12:40:08 UTC
A search for "station box" on LuceneSearch gives 42 results. Searching for "station box"~ on CirrusSearch gives 327 results, so the new search is matching many more pages.

An example from the first 20 CirrusSearch reslts, [[Ormside railway station]], doesn't contain any occurrence of the term 'station' followed by the term 'box'.
Comment 3 Nik Everett 2014-08-07 20:24:00 UTC
That looks like a phrase slop error.  The default slop should be 0 but is 1 in this case.
Comment 4 Nik Everett 2014-08-13 21:10:47 UTC
For the most part this is caused by the phrase slop issue I mention earlier.  The temporary work around is to search for <"station box"~0~>.  What uses 0 slop and stemmed matching.  I'm switching the default slop to 0 for the stemmed matching so you won't have to do this in a few weeks once its merged and deployed.

Another issue that is causing extra results is a thing called "position offset gap".  For fields in the search that are multivalued a search for "station box" can currently find matches _across_ those multiple values.  I found that issue while working one something else a few weeks ago and the fix is being applied literally right now.  It requires an index rebuild so give it 24 hours.
Comment 5 Gerrit Notification Bot 2014-08-13 21:14:37 UTC
Change 153943 had a related patch set uploaded by Manybubbles:
Switch default phrase slop to 0

https://gerrit.wikimedia.org/r/153943
Comment 6 Gerrit Notification Bot 2014-08-18 14:42:41 UTC
Change 153943 merged by jenkins-bot:
Switch default phrase slop to 0

https://gerrit.wikimedia.org/r/153943

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links