Last modified: 2013-09-30 19:57:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54948, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52948 - Search does not return files with search terms in metadata or filename
Search does not return files with search terms in metadata or filename
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
Elasticsearch0.90.4
: upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-16 23:47 UTC by Sumana Harihareswara
Modified: 2013-09-30 19:57 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sumana Harihareswara 2013-08-16 23:47:30 UTC
I search for "savepage" and "screenshot" and "savepage-greyed" on test2, e.g., https://test2.wikipedia.org/w/index.php?title=Special:Search&search=screenshot&fulltext=Search&profile=all&redirs=0 , and do not find https://test2.wikipedia.org/wiki/File:Savepage-greyed.png in the results, even though that file has the word "Screenshot" in its file summary.
Comment 1 Bawolff (Brian Wolff) 2013-08-16 23:52:59 UTC
(In reply to comment #0)
> I search for "savepage" and "screenshot" and "savepage-greyed" on test2,
> e.g.,
> https://test2.wikipedia.org/w/index.php?title=Special:
> Search&search=screenshot&fulltext=Search&profile=all&redirs=0
> , and do not find https://test2.wikipedia.org/wiki/File:Savepage-greyed.png
> in
> the results, even though that file has the word "Screenshot" in its file
> summary.

I would be more concerned with it not picking up "Screenshot" in the image description page body over it not picking the word out of the img_comment.
Comment 2 Nik Everett 2013-08-19 15:38:37 UTC
Triaging to high.  Weird.
Comment 3 Nik Everett 2013-08-20 15:15:36 UTC
The following searches seem to find it just fine:
https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=Savepage-greyed.png&fulltext=Search&srbackend=CirrusSearch

https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=Savepage&fulltext=Search

https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=greyed.png&fulltext=Search

https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=screenshot&fulltext=Search

But this searches didn't find the file and probably should:
https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=greyed&fulltext=Search

I seems to be not working because Savepage-greyed.png is tokenized as "savepag" and "greyed.png" [1] which isn't really what we want.  I'm not sure what we do want though.  Maybe "savepag" and "grey" and ".png".


[1] Running http://<elasticsearch_host>:9200/nikwiki_general/_analyze?analyzer=text&text=Savepage-greyed.png spits out
{
  "tokens": [
    {
      "token": "savepag",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "greyed.png",
      "start_offset": 9,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}
Comment 4 Nik Everett 2013-08-20 15:20:09 UTC
By the way, I'm not sure why when you reported the problem it wasn't working but is now.  I'm going to add a few more regression tests to make sure that the searches that do work continue to work then I might merge the tokenizing problem into Bug 53013 and continue to work through the remaining bugs.
Comment 5 Nik Everett 2013-08-20 18:41:59 UTC
I've added some file search regression tests in https://gerrit.wikimedia.org/r/#/c/80074/ .  Now that most searches seem to be working and we've got regression tests for them I'm going to lower this to normal priority and work on the stemming problem I mentioned in Comment 3 when I've knocked out the higher priority bugs.
Comment 6 Nik Everett 2013-09-04 13:51:22 UTC
I want to fix this with a Pattern Capture token filter but that isn't in the version of Elasticsearch we're using (0.90.2.)  It _is_ in 0.90.3 but since 0.90.4 is supposed to be coming out "early next week" and we've got a bunch of bugs waiting on that I'm tagging this as waiting on that too.  With this filter fixing the last portion of this bug should be pretty simple.
Comment 7 Nik Everett 2013-09-30 19:57:24 UTC
I looked into this some more and I'm still not happy with it.  I can fix it by adding a PatternCaptureFilter with the pattern "([^\.]+)" but that has some problems:
1.  Highlighting just gets really really confused.  If one part matches then the whole thing matches.
2.  Adding a regex like that to every single token can't be quick.
3.  I just looked at Bug 54669 which wanted _more_ precision around funky token patterns.

I'm going to resolve this to fixed now because the original problem, not being able to find a fine name, is pretty well fixed.  I'd love to hear arguments for the splitting around the . change in the file name, but I'm currently convinced it isn't a good idea.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links