Last modified: 2014-10-28 16:20:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72873, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70873 - Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL
Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs ...
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
master
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-15 23:59 UTC by Bartosz Dziewoński
Modified: 2014-10-28 16:20 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bartosz Dziewoński 2014-09-15 23:59:43 UTC
Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL. insource:/mazovia\.pl/, however, does find it, but takes ages to run. I can't tell if this is a bug or expected behavior for some reason.

For example, https://pl.wikipedia.org/w/index.php?title=Specjalna%3ASzukaj&profile=default&search=insource%3A%22mazovia.pl%22&fulltext=Search should find https://pl.wikipedia.org/wiki/Elżbieta_Lanc , but doesn't.
Comment 1 Nik Everett 2014-09-16 13:52:32 UTC
Its not right but not unexpected.  insource:"" segments words in the same way that we segment regular text.  I can't think of a workaround for you at this point either - the insource:// is going to be slow unless you have another filter like insource:"" but insource:"" is too much for you.

Thinking out loud for a solution: I wonder if its safe to trick the language analyzer by pretending that ".", ":", "/" are " ".   That'll cause splits where we want them I think.  I'm not sure if that is right for all text in all languages but, maybe?
Comment 2 Bartosz Dziewoński 2014-09-16 15:28:46 UTC
Oh, so URLs are one "segment", and this doesn't find "substrings"? That makes sense.

Splitting on these characters sounds reasonable to me. There are some cases like "AC/DC", but that shouldn't cause any problems, right?
Comment 3 Nik Everett 2014-09-16 15:50:11 UTC
(In reply to Bartosz Dziewoński from comment #2)
> Oh, so URLs are one "segment", and this doesn't find "substrings"? That
> makes sense.
> 
> Splitting on these characters sounds reasonable to me. There are some cases
> like "AC/DC", but that shouldn't cause any problems, right?

You've got it.  The way search works is that all the words are segmented (tokenized) and then normalized and then indexed for quick lookup.  The trick is that each language is subtly different and I only speak English so I can only validate that choices make sense there.  And its hard to propose changes that cross many languages.

Anyway, I'll see if I can make a tool to easily look at how words are segmented in your language.  And I'll see if I can make it easy to experiment a bit with stuff.
Comment 4 Nik Everett 2014-10-28 16:20:04 UTC
Works now!
Comment 5 Nik Everett 2014-10-28 16:20:39 UTC
This was probably fixed by my change to the analyzer to consider . like space.  It wasn't just fixed magically - just as the solution to another bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links