Last modified: 2014-10-28 16:20:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72873, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 70873 - Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL


Summary:	Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs ...

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	CirrusSearch (Other open bugs)
Version:	master
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-09-15 23:59 UTC by Bartosz Dziewoński
Modified:	2014-10-28 16:20 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Bartosz Dziewoński 2014-09-15 23:59:43 UTC

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL. insource:/mazovia\.pl/, however, does find it, but takes ages to run. I can't tell if this is a bug or expected behavior for some reason.

For example, https://pl.wikipedia.org/w/index.php?title=Specjalna%3ASzukaj&profile=default&search=insource%3A%22mazovia.pl%22&fulltext=Search should find https://pl.wikipedia.org/wiki/Elżbieta_Lanc , but doesn't.

Comment 1 Nik Everett 2014-09-16 13:52:32 UTC

Its not right but not unexpected.  insource:"" segments words in the same way that we segment regular text.  I can't think of a workaround for you at this point either - the insource:// is going to be slow unless you have another filter like insource:"" but insource:"" is too much for you.

Thinking out loud for a solution: I wonder if its safe to trick the language analyzer by pretending that ".", ":", "/" are " ".   That'll cause splits where we want them I think.  I'm not sure if that is right for all text in all languages but, maybe?

Comment 2 Bartosz Dziewoński 2014-09-16 15:28:46 UTC

Oh, so URLs are one "segment", and this doesn't find "substrings"? That makes sense.

Splitting on these characters sounds reasonable to me. There are some cases like "AC/DC", but that shouldn't cause any problems, right?

Comment 3 Nik Everett 2014-09-16 15:50:11 UTC

(In reply to Bartosz Dziewoński from comment #2)
> Oh, so URLs are one "segment", and this doesn't find "substrings"? That
> makes sense.
> 
> Splitting on these characters sounds reasonable to me. There are some cases
> like "AC/DC", but that shouldn't cause any problems, right?

You've got it.  The way search works is that all the words are segmented (tokenized) and then normalized and then indexed for quick lookup.  The trick is that each language is subtly different and I only speak English so I can only validate that choices make sense there.  And its hard to propose changes that cross many languages.

Anyway, I'll see if I can make a tool to easily look at how words are segmented in your language.  And I'll see if I can make it easy to experiment a bit with stuff.

Comment 4 Nik Everett 2014-10-28 16:20:04 UTC

Works now!

Comment 5 Nik Everett 2014-10-28 16:20:39 UTC

This was probably fixed by my change to the analyzer to consider . like space.  It wasn't just fixed magically - just as the solution to another bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links