Last modified: 2014-10-01 01:01:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71766, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69766 - Stemming: CirrusSearch does not find Биология for a Биологии query on ru.wp (but LuceneSearch does)
Stemming: CirrusSearch does not find Биология for a Биологии query on ru.wp (...
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
: i18n
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-20 01:01 UTC by Gryllida
Modified: 2014-10-01 01:01 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Comment 1 Nik Everett 2014-08-20 14:20:21 UTC
Paraphrasing the link:
Searched for биологии and биология wasn't in the search results but it should have been.


Other information:
Searching for биология returns both биологии and биология as it should.
Comment 2 Gryllida 2014-08-21 00:22:40 UTC
Can someone please give this a higher priority.
A gadget (one in mediawiki:*), not a userscript, is becoming broken as lucene search is being removed from wikis.
Comment 3 Nemo 2014-08-21 09:56:15 UTC
Removed link to outdated plan.
Comment 4 Andre Klapper 2014-08-21 11:30:27 UTC
Which exact gadget (link please)? How is it "broken"?
Comment 5 Gryllida 2014-08-21 12:54:36 UTC
This one: https://ru.wikipedia.org/wiki/MediaWiki:Gadget-wikilinker.js
Its documentation: https://ru.wikipedia.org/wiki/Википедия:Гаджеты/Викиссыльщик

Description of the gadget issues:

TL;DR: they actually look up "биолог*" as they do client-side stemming in js to format wikilinks correctly. The cirrus search gives weird results, as it misses [[Биология]] on RU.WN for some reason, while it doesn't miss it on RU.WP. 

This should probably go to a separate bug, or it probably should not - I have not yet analysed this behaviour enough to understand whether it has anything to do with the original issue described in this bug.

----

The gadget has 3 versions.

1) The old version (old version in diff [1]) uses default search engine.

loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);

2) The "new" version (new version in diff [1]) has explicitly set to use lucene search. This was done in this edit [1] with comment that cirrus search gives unreliable results. This version works at Russian Wikipedia, but stopped working at Russian Wikinews around the end of July. It gives a 'HTTP timeout' message.

loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);

3) A local Russian Wikinews [2] (and maybe other projects) version is exactly the same as version (1).

var xmlDocUrl = '//ru.wikipedia.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&origin=' + document.location.protocol + '//' + document.location.hostname + '&srsearch=' + preparedText;

[1] https://ru.wikipedia.org/w/index.php?title=MediaWiki:Gadget-wikilinker.js&diff=60626342&oldid=56265657
[2] https://ru.wikinews.org/wiki/MediaWiki:Gadget-wikilinker.js

---

I looked at raw net log in a browser and realised that apparently the scripts all use client-side stemming and look up "биолог*". Sorry, I probably wrongly identified the issue causing the gadget being broken. Now, some more analysis follows, below; I would file a new bug, but I did not yet identify the exact issue behind the problem and whether it is different.

---

This means we have these 3 sort of queries:

1) api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
2) api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
3) api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*

OK, 1 and 3 are the same. Forget 3.

---


This means we have these 2 sort of queries:

1) api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
2) api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*

---

Results:

Russian Wikipedia:
1) Биология
2) Биология

Russian Wikinews:
1) Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков
2) HTTP timeout

Note that "Биология" also exists at Russian Wikinews (although it is a redirect).


Now Russian Wikinews no longer can use Wikilinker to link to local articles.
Comment 6 Nik Everett 2014-08-27 21:01:55 UTC
Don't worry about filing a new bug.  I've got it from here.
Comment 7 Nik Everett 2014-08-27 21:21:24 UTC
This looks to have been caused by us not using unicode style regexes when detecting the * syntax.  We have a feature that was supposed to run those prefix queries against the unstemmed copy of the text but it wasn't kicking in for cyrillic because php hates me.
Comment 8 Nik Everett 2014-08-28 00:00:13 UTC
Proposed fix:  https://gerrit.wikimedia.org/r/#/c/156699/
Comment 9 Nik Everett 2014-08-28 13:18:04 UTC
Merged.  It'll go to test wikis and mediawiki.org today, non-wikipedias on Tuesday, and wikipedias on Thursday.
Comment 10 Gryllida 2014-09-08 22:32:56 UTC
Presumably this is in production now. Sorry, I don't see this work as expected now.

http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml does not return 'Биология' or 'Category:Биология', but it should (especially the former). Instead, it returns "Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков" and other long article names.
Comment 11 Nik Everett 2014-09-09 14:51:20 UTC
Now you've hit something else!  Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search.  This finds the category:
http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml&srnamespace=0,14

While its possible for me to fix this its a pretty difficult change and hasn't caused _too_ many problems.  If possible, can you get the tool to work around this?
Comment 12 Gryllida 2014-09-09 22:52:15 UTC
> Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search.
Doesn't make sense to me:
1) For some reason, in your URL I don't see "Бология" in the results, although the page it redirects to ("Category:Биология" = "Категория:Биология") is included. This behavior appears to be inconsistent with your comment.
2) A search tool should be able to find namespace pages even if they are redirects. Their title /does/ match, after all. It probably makes no sense for end users who lack interest in categories. But then please consider making "do not follow redirects" an option.

> If possible, can you get the tool to work around this?
This wikilinker gadget needs to produce [[биология|биологии]],  [[биология|биологией]], [[биология|биология]] links reliably. [[Category:биология|биологии]] sort of links are against project policies. So I guess no, I can't work around this, unless I missed some pretty things, or unless I'm willing to do such an ugly thing as check for "Category:$1" pattern in the result and manually check whether $1 main namespace page exists. One would think that this has to be done server-side. (Lucene Search worked fine with it, btw.)
Comment 13 Nik Everett 2014-09-22 13:37:02 UTC
Sorry for not updating this earlier.  The answer is no.  Redirects won't appear in Cirrus's results.  Your welcome to continue using lsearchd until its turned off in a few months by adding &srbackend=LuceneSearch to the url parameters but Cirrus isn't going to show you the redirect page as a result.  It'll always come back as [[Category:биология|биологии]] with Cirrus and at some point we'll shut down lsearchd and you won't be able to select it any more.

As to searches in the main namespace finding main namespace redirects to the category namespace - it probably should but its not going to happen soon.

Sorry this isn't what you wanted to hear.
Comment 14 Gryllida 2014-09-22 21:28:09 UTC
Can I open a new bug about redirects=yes param to Cirrus Search?  Not having it as a default may be reasonable, of course, but as I said above, we workaround is ugly...

> This wikilinker gadget needs to produce [[биология|биологии]],  
> [[биология|биологией]], [[биология|биология]] links reliably. 
> [[Category:биология|биологии]] sort of links are against project policies. So 
> I guess no, I can't work around this, unless I missed some pretty things, or 
> unless I'm willing to do such an ugly thing as check for "Category:$1" 
> pattern in the result and manually check whether $1 main namespace page 
> exists. One would think that this has to be done server-side.
Comment 15 Nik Everett 2014-09-22 21:44:28 UTC
Certainly!  Its better to have a bug then not - even if all it does is reference the conversation in this bug.
Comment 16 Gryllida 2014-10-01 01:01:53 UTC
Ok, filed bug 71491. Thanks! :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links