Last modified: 2014-07-07 21:04:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65861, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63861 - CirrusSearch word segmentation not useful for JS and CSS pages
CirrusSearch word segmentation not useful for JS and CSS pages
Status: REOPENED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
master
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-12 16:02 UTC by Helder
Modified: 2014-07-07 21:04 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Comment 1 Nik Everett 2014-04-14 14:03:46 UTC
This is caused by word segmentation rather then a problem getting the text into the index.  You can find them if you search like so:
https://test.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=tem.getInitial&fulltext=Search

I'll solve this by improving the setting that Cirrus has called "aggressive splittings" and rolling it out to all documents.  It might cause some unintended results to show up in other places but they should be sorted below the more exact matches.
Comment 2 Helder 2014-04-14 17:27:05 UTC
This search doesn't only returns one of the two results. I can get both if I search for "tem.getInitial OR this.getInitial":
https://test.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=tem.getInitial+OR+this.getInitial&fulltext=Search
But that only works if I know the only prefixes will be "tem." and "this.".

I don't know if this will affect the "aggressive splittings" you mention above.
Comment 3 Nik Everett 2014-04-14 18:12:24 UTC
Helder: yeah, that's the splitting.  Without it "tem.getInitial" and "this.getInitial" are separate terms.  With it the terms are "tem", "getInitial", and "this".  I'll push a more precise test case now
Comment 4 Gerrit Notification Bot 2014-04-14 18:12:34 UTC
Change 125764 had a related patch set uploaded by Manybubbles:
Better test case for word splitting in js

https://gerrit.wikimedia.org/r/125764
Comment 5 Nik Everett 2014-04-14 18:14:05 UTC
Technically this was resolved in https://gerrit.wikimedia.org/r/#/c/125731/ but that extra commit adds a better test case.

It'll require a reindex after deployment, and only hits English.  I'm working on other languages but that is more complicated unfortunately.
Comment 6 Gerrit Notification Bot 2014-04-15 01:14:42 UTC
Change 125764 merged by jenkins-bot:
Better test case for word splitting in js

https://gerrit.wikimedia.org/r/125764

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links