Last modified: 2014-06-25 13:59:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T68243, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 66243 - Regression: using unicode normalization analyzer misses results in search
Regression: using unicode normalization analyzer misses results in search
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Nik Everett
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-06 07:22 UTC by matanya
Modified: 2014-06-25 13:59 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description matanya 2014-06-06 07:22:42 UTC
Hello, since the unicode normalization analyzer was installed for Hebrew some expected search results are missed.

How to reproduce:

Compare results form this search in wikidata:

https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3

to the same search in hebrew wiki:

https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A

One would expect the five results showing in wikidata search would show up in hebrew wiki, but The first and last result on wikidata don't appear on hebrew wiki search results.

Best
Comment 1 Nik Everett 2014-06-10 13:12:56 UTC
Result number 1 and 5 in wikidata look like result number 1 and 2 on hewiki.  I wonder if we lost those pages temporarily?  That'd be bad.
Comment 2 Nik Everett 2014-06-10 13:13:17 UTC
Or, am I reading it wrong?
Comment 3 Nik Everett 2014-06-10 15:17:40 UTC
This is a better comparison:
https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3
to
https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A&fulltext=1
The first result in wikidata (https://www.wikidata.org/wiki/Q7003270) isn't in the hewiki results.   On further digging, the page exists at (https://he.wikipedia.org/wiki/%D7%A7%D7%9C%D7%99%D7%A4%D7%95%D7%A8%D7%93_%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99) but when I try to fetch it from the search index it isn't in there:
manybubbles@elastic1003:~$ curl localhost:9200/hewiki_content/page/495403
{"_index":"hewiki_content_1401724632","_type":"page","_id":"495403","found":false}
So what is the deal?
Comment 4 Nik Everett 2014-06-10 18:01:15 UTC
(In reply to Nik Everett from comment #3)
> So what is the deal?


That is rhetorical - I'm going to figure it out.
Comment 5 Nik Everett 2014-06-10 18:09:16 UTC
I added that page back into the index:
manybubbles@terbium:~$ mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki hewiki --fromId 495402 --toId 495403
Indexed 1 pages ending at 495403 at 6/second
Indexed a total of 1 pages at 6/second
manybubbles@terbium:~$ 


That's just remediation.  Now to figure out why it wasn't in there in the first place.
Comment 6 Gerrit Notification Bot 2014-06-11 16:12:38 UTC
Change 138835 had a related patch set uploaded by Manybubbles:
Add a maintenance script to make the index sane

https://gerrit.wikimedia.org/r/138835
Comment 7 Nik Everett 2014-06-11 16:23:22 UTC
I've written a tool to scan the index and look for insanity.  I'm tempted to chalk some insanity in Hebrew up to the hebrew analyzer which was buggy and we had it in production for two weeks.  The tool should heal whatever damage it did.  Then we'll run it again a few days later and see if we get _more_ insanity.  That'll have the benefit of being recent.
Comment 8 Gerrit Notification Bot 2014-06-12 15:53:30 UTC
Change 138835 merged by jenkins-bot:
Add a maintenance script to make the index sane

https://gerrit.wikimedia.org/r/138835
Comment 9 Nik Everett 2014-06-25 13:59:12 UTC
Saneitizer seems to have done the trick here.  I'm going to claim it was the broken analyzer.  If we lose more pages I'll revise that claim.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links