Last modified: 2014-10-24 07:37:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73491, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71491 - Redirects should appear in Cirrus's results
Redirects should appear in Cirrus's results
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-01 01:01 UTC by Gryllida
Modified: 2014-10-24 07:37 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gryllida 2014-10-01 01:01:36 UTC
[ It is advised that this bug is treated with higher priority, as deprecation of Lucene search has rendered a Wikilinker Gadget unusable in production at some Wikimedia projects. ]

== Background ==
1) A wikilinker gadget needs to produce [[биология|биологии]],  [[биология|биологией]], [[биология|биология]] links reliably. [[Category:биология|биологии]] sort of links are against project policies.
2) On some projects, such as Russian Wikinews, [[Биология]] is a redirect to [[Category:Биология]], because "Биология" is not a valid news headline and will never be.
3) Lucene search worked fine as it also returns redirects. But it is deprecated and no longer running in production.

== Problem description ==
1) http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml does not return 'Биология' or 'Category:Биология', but it should (especially the former).  (Instead, it returns "Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков" and other long article names.)
2) 
>> If possible, can you get the tool to work around this?
> This wikilinker gadget needs to produce [[биология|биологии]],  
> [[биология|биологией]], [[биология|биология]] links reliably. 
> [[Category:биология|биологии]] sort of links are against project policies. So 
> I guess no, I can't work around this, unless I missed some pretty things, or 
> unless I'm willing to do such an ugly thing as check for "Category:$1" 
> pattern in the result and manually check whether $1 main namespace page 
> exists. One would think that this has to be done server-side.
(From discussion at bug 69766.)

**Now Russian Wikinews no longer can use Wikilinker to link to local articles-redirects.**

== Proposed change ==
Please add an option to show redirects in Cirrus Search results, even if this option is off by default.
Comment 1 Chad H. 2014-10-01 01:57:32 UTC
I can't help but think we've got something backwards here. Wasn't the point of Gerrit change #118592 to always include them?
Comment 2 Gryllida 2014-10-01 10:19:27 UTC
Please see problem description, (1).

How to get "Биология" (and NOT 'Категория:Биология') appear in these results?
Comment 3 Nik Everett 2014-10-01 13:29:07 UTC
What Cirrus does now is always search for pages by their redirects but its always the target of the redirect that is returned and never the redirect itself.  Cirrus thinks of redirects as attributes of the target of the redirect and ignores redirect pages themselves.  Look at the redirects object in the json blob here:  https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump

The upshot is that when you search you can find the result via the redirect and it'll come back in the redirect field but it'll never come back as a title.  Example:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&format=json&srsearch=O%27bama&srprop=snippet|titlesnippet|redirectsnippet|sectionsnippet&srlimit=10&srbackend=CirrusSearch


I'm honestly not sure what lsearchd does for this.  Its similar to Cirrus so far as I can tell:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&format=json&srsearch=O%27bama&srprop=snippet|titlesnippet|redirectsnippet|sectionsnippet&srlimit=10&srbackend=LuceneSearch

but it doesn't seem to always work similarly to Cirrus.  If it produced the right result for Wikilinker then it must be different somehow.  Its a lot of code to read and I've read a lot of it but I don't recall reading this part.


In any case for now I think the simplest solution for wikilinker is to set srbackend=LuceneSearch to keep the old behavior.  That'll certainly buy us a few months of continued working and its reasonably simple.
Comment 4 Gryllida 2014-10-01 22:16:59 UTC
Yeah, I think it's complicated and I'm sure we'd figure it out. :-)

---
srbackend=LuceneSearch gives HTTP timeout on Russian Wikinews:

http://ru.wikinews.org/w/api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=xml&srsearch=биолог*
Истекло время ожидания HTTP-запроса. = HTTP request timed out.

Should I gather community consensus on re-enabling it or can you just do it?  I can again file a new bug if necessary.
---
Comment 5 Gryllida 2014-10-23 21:50:16 UTC
I repeat: how can I re-enable LuceneSearch on this project?
Comment 6 Chad H. 2014-10-23 22:00:37 UTC
Enabling Lucene as the default wouldn't change anything that srbackend can't already do.
Comment 7 Nik Everett 2014-10-23 22:21:08 UTC
Two answers to two different questions:
1.  If LuceneSearch is timing out or failing in another way we'll need to fix it.  At lest for the next few months.  The link seems to be working now but I have no doubt that it was failing before.  Its pretty difficult to debug it without breaking it worse.  That's why we're moving away from it.  If it fails again please update the bug or ping us on irc or something.

2.  This kind of issue isn't going to make us switch LuceneSearch back to the primary for ruwiki or ruwikinews.  We're totally willing to switch back if Cirrus hurts more then it helps but in this case you have a clear work around and you were relying on a behavior that no one else noticed.  And the behavior just doesn't seem helpful outside of your use case.


Truthfully I've change Cirrus's behavior to match Lucene's quirks in the past even though it had a pretty nice audience and I'd do it again in this case too but what your asking for would require a huge architectural change to cirrus which just isn't worth it.

If/when we do have to make some huge change I'll keep this is mind so its an option but I honestly don't think I can do more than that at this point.
Comment 8 Nik Everett 2014-10-23 22:26:53 UTC
A (maybe bad) idea!  What if we piped the list of redirects that match the query back through the api.  You'd still get the non-redirect page back but it'd come with a redirect list.
Comment 9 Gryllida 2014-10-24 07:37:38 UTC
> The link seems to be working now but I have no doubt that it was failing before.  Its pretty difficult to debug it without breaking it worse.

Oops. Works now.

> 2.  This kind of issue isn't going to make us switch LuceneSearch back to the primary for ruwiki or ruwikinews.

I am not asking for primary. I would like to have it as an /option/. At the time it was timing out, there was no such option.

> A (maybe bad) idea!  What if we piped the list of redirects 
> that match the query back through the api.  You'd still 
> get the non-redirect page back but it'd come with a redirect list.

The way the wikilinker gadget works now, with LuceneSearch, is that it takes the first 3 results and chooses the shortest one. Such rewrite could be a threat to results relevancy or similarity.

Snippet from the gadget:

--
// если в запросе было только одно слово, то выбираем самое короткое название из первых трёх результатов
// чтобы для "Аглией" выдавалось "Англия", а не "Англиканство"
if ( requestTokens === 1 ) {
    var resar = [];
    
    for ( var j = 0; j <= 4; j++ ) {
        if ( typeof resp.query.search[j] !== 'undefined' && txt.substr( 0, 3 ).toLowerCase() === resp.query.search[j].title.substr( 0, 3 ).toLowerCase()) {
            resar.push( resp.query.search[j].title );
        }
    }
    
    resar.sort( compareStringLengths );
    
    if ( typeof resar[0] !== 'undefined' ) {
        pageName = resar[0];
    }
}
--

This could be, in theory, rewritten to pull a list of results + the redirects, and picking the shortest ones. But this would not guarantee the best result, as some of the redirect origins could be rather short but irrelevant. See:

lucene search:
1. [[biology]] (redirect to [[category:biology]])
2. [[category:biology]]
3. ...

cirrus search:
1. [[category:biology]]
  <- [[biology]]
  <- [[bio]]
  <- ...
2. [[biologists discover world's smallest orchid]]
  <- ...

With cirrus search, the script for "biology" would return "[[bio|biology]]", but it doesn't mean that such wikilink would be accurate.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links