Last modified: 2014-11-20 16:57:54 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T75623, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 73623 - Inconsistent search results when combining list=search and generator=search
Inconsistent search results when combining list=search and generator=search
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: High normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 73616
  Show dependency treegraph
 
Reported: 2014-11-19 21:40 UTC by Dmitry Brant
Modified: 2014-11-20 16:57 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Results set for "fish" before update (409.03 KB, image/png)
2014-11-19 23:42 UTC, Monte Hurd
Details
Result set for "fish" after update (280.77 KB, image/png)
2014-11-19 23:43 UTC, Monte Hurd
Details

Description Dmitry Brant 2014-11-19 21:40:00 UTC
Background:
In our Mobile apps, we're implementing full-text searching when the user types a search term. The way we query the api is as follows:

- We use prop=pageprops with generator=search to get the list of results, which also gives us the image thumbnail and Wikidata ID for each result, but this returns the results in the wrong order, so:
- We combine the query with list=search, which should give the same set of results (albeit without thumbnails or Wikidata ID), but in the correct order.
- We then correlate the data from the two lists (expecting both lists to contain the exact same items) to arrive at the full results in the correct order.

For example, our query is something like this:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&prop=pageprops&format=json&srsearch=Barack&srnamespace=0&srwhat=text&srinfo=suggestion&srlimit=12&ppprop=wikibase_item&generator=search&gsrsearch=Barack&gsrnamespace=0&gsrwhat=text&gsrprop=redirecttitle&gsrlimit=12

However:
The two sets of results often don't match up with each other -- there are sometimes one or more pages in one list that are not present in the other.  What's even more strange, this doesn't happen every time:  the lists might match up in one instance, and then be different in another (for the exact same query).

For example, try executing the query linked above, and observe the last item in the returned "pages" array. 4 out of 5 times, the item will be "There's No One as Irish as Barack O'Bama", but sometimes it becomes "Statewide opinion polling for the United States presidential election, 2012", which is no longer consistent with the other list returned by list=search.
Comment 1 Brad Jorsch 2014-11-19 23:01:12 UTC
Just looking at the "last" item in the pages object isn't going to tell you much, because JSON doesn't guarantee ordering in hashes.

After repeated trials, I cannot reproduce a single trial where 'pages' contains a title not in 'search' or vice versa. If this does somehow occur, it's most likely due to the underlying search engine rather than anything in the API.
Comment 2 Monte Hurd 2014-11-19 23:34:27 UTC
Just to clarify, the "search" results list is not a hash - it's an array. It was needed because the "pages" results are a hash, so the order was meaningless.
Comment 3 Nik Everett 2014-11-19 23:40:48 UTC
(In reply to Monte Hurd from comment #2)
> Just to clarify, the "search" results list is not a hash - it's an array. It
> was needed because the "pages" results are a hash, so the order was
> meaningless.

Why not change the search results lists to allow it to include the properties that you need?  That seems about 1000000% more efficient than making two http calls.

Search results should be mostly consisten across requests but there is nothing that keeps them entirely consistent.  We're changing the underlying index constantly and we make no effort to cache because the tail is so long.  You search may not even be sent to the same set of machines when you repeat it.  This could cause slight shifting around the edges.

It'd be a bug if something was in the middle one time and not there the next.

Seriously, make just one query.  Save some load on the search cluster and save the user some time.
Comment 4 Monte Hurd 2014-11-19 23:42:54 UTC
Created attachment 17176 [details]
Results set for "fish" before update
Comment 5 Monte Hurd 2014-11-19 23:43:20 UTC
Created attachment 17177 [details]
Result set for "fish" after update
Comment 6 Nik Everett 2014-11-19 23:51:29 UTC
Those results are super duper different.
Comment 7 Nik Everett 2014-11-20 00:42:17 UTC
Could you turn those screenshots into urls?

Also Cirrus intentionally returns redirects in the list so that funny search for fishspinner is actually a result for tropical cyclone.  One thing we could do is penalize results that we find via a redirect as compared to a page title match.  We don't do that right now.  It'd probably make the results make more sense.

Still I think the underlying issue is that the list=search api doesn't return lwhat you want it to.
Comment 9 Monte Hurd 2014-11-20 01:37:43 UTC
Note that all of my usual daily search testing terms - "art", "cat", "bird", "toast", have underwent a similar degradation as "fish". 

The first result is still always a match, but, for example, the second result when searching for "bird" is now "Bird-Magic rivalry", which may be a great article, but it's probably not the second-most relevant bird article on enwiki :)
Comment 10 Nik Everett 2014-11-20 01:48:00 UTC
I'll work on the relevance issue.  Is this bug about that or the different between list=search and generator=search?  I don't know much about the api issue but I can totally work on relevance.
Comment 11 Monte Hurd 2014-11-20 02:05:38 UTC
Good point - we may be conflating two issues. Thanks for tackling the relevance part!

I think this bug could then just track the "different between list=search and generator=search" part if that sounds ok?
Comment 12 Nik Everett 2014-11-20 02:53:06 UTC
OK.  I'm not working on that now because I don't think it's as bad.  Also I don't understand our API too well so I'd have to do some research.  I don't even know enough to know if the bug is in CirrusSearch.

I'm going to reset the assignee to default for now.

Finally: you really should be able to get the results in one query.  That you can't is also a bug.

I'm tracking the relevance issue as bug 73636
Comment 13 Matthew Flaschen 2014-11-20 04:24:36 UTC
(In reply to Nik Everett from comment #12)
> OK.  I'm not working on that now because I don't think it's as bad.  Also I
> don't understand our API too well so I'd have to do some research.  I don't
> even know enough to know if the bug is in CirrusSearch.

The general idea is that lists return results directly, and generators use the output of a list to feed into another module (like a UNIX pipe).  So generators essentially are the intended way to combine modules in one query like this.  The problem is for some reason list= is also used/necessary in this case.  See https://www.mediawiki.org/wiki/API:Query#Generators

(In reply to Monte Hurd from comment #8)
> Here is the exact query the iOS app used for those two screenshots:
> 
> https://en.m.wikipedia.org/w/api.
> php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace=
> 0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14
> 4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish

I notice this does not use indexpageids.  So you're only getting it in the correct order if whatever iOS JSON parser you're using preserves order in hashes/associative arrays (possible, but not certain).  

See https://en.m.wikipedia.org/w/api.php?action=query&format=jsonfm&generator=prefixsearch&gpslimit=24&gpsnamespace=0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=144&ppprop=wikibase_item&prop=pageprops|pageimages&pslimit=24&pssearch=fish&indexpageids for a version with indexpageids with a clear array of results.

Also, it's using generator=prefixsearch not generator=search (which comment #0 mentions).
Comment 14 Monte Hurd 2014-11-20 04:57:55 UTC
Matt - it seems the indexpageids are not in the right order. 

And the native json parser on iOS doesn't preserve order within resulting associative arrays - just arrays. I believe this is true w/android as well. 

We could write custom parsing logic to determine order if these hash entries were made to be in the correct order, but that's a hack every non-php consumer of this data would have to implement.

Would be perfect if the entries could be both ordered correctly and returned in order-preserving json array format. Can there be a flag that just caused results to come back array format?
Comment 15 Monte Hurd 2014-11-20 05:02:00 UTC
*to clarify, the array format flag would return array of page hashes, rather than hash of page hashes*
Comment 16 Brad Jorsch 2014-11-20 14:59:22 UTC
(In reply to Matthew Flaschen from comment #13)
> (In reply to Nik Everett from comment #12)
> > OK.  I'm not working on that now because I don't think it's as bad.  Also I
> > don't understand our API too well so I'd have to do some research.  I don't
> > even know enough to know if the bug is in CirrusSearch.
> 
> The general idea is that lists return results directly, and generators use
> the output of a list to feed into another module (like a UNIX pipe).  So
> generators essentially are the intended way to combine modules in one query
> like this.  The problem is for some reason list= is also used/necessary in
> this case.  See https://www.mediawiki.org/wiki/API:Query#Generators

This. The general problem people seem to be running into with search as a generator is that they want to keep some indication of the ranking despite generators currently being defined as producing an unordered list of pageids (or revids).

I'm not opposed to allowing generators to provide additional properties for the generated pages (reopening bug 14859, although I'd probably do it in a different manner than requested there) but it'll take some thought as to how to best do that. I'll put something on [[mw:API/Architecture work/Planning]] so I don't forget to look into it.


As for how the API code works in this situation, it instantiates two instances of ApiQueryPrefixSearch, and each instance runs the same code to get a list of titles:

 $searcher = new TitlePrefixSearch;
 $titles = $searcher->searchWithVariants( $search, $limit, $namespaces );

The processing of that list of titles is different, but it should theoretically be the same titles in the list. What seems to be going on is that this code is not entirely deterministic.

(In reply to Monte Hurd from comment #8)
> Here is the exact query the iOS app used for those two screenshots:
> 
> https://en.m.wikipedia.org/w/api.
> php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace=
> 0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14
> 4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish

I note you're using gpsnamespace=0, but not psnamespace=0. In this case that doesn't matter, though, since 0 is the default.

I have been able to reproduce the issue with this query, BTW.

(In reply to Matthew Flaschen from comment #13)
> I notice this does not use indexpageids.  So you're only getting it in the
> correct order if whatever iOS JSON parser you're using preserves order in
> hashes/associative arrays (possible, but not certain).

They're presumably using query.prefixsearch (which *is* a JSON array) for the ordering, using the pageids from there to look up the additional info in the query.pages hash.
Comment 17 Nik Everett 2014-11-20 15:09:23 UTC
(In reply to Brad Jorsch from comment #16)
> As for how the API code works in this situation, it instantiates two
> instances of ApiQueryPrefixSearch, and each instance runs the same code to
> get a list of titles:
> 
>  $searcher = new TitlePrefixSearch;
>  $titles = $searcher->searchWithVariants( $search, $limit, $namespaces );
> 
> The processing of that list of titles is different, but it should
> theoretically be the same titles in the list. What seems to be going on is
> that this code is not entirely deterministic.
> 

That is why I originally WONTFIXED it.  Searches aren't deterministic because the data changes under them.  lsearchd was _more_ deterministic than cirrus because it changed slower and had fewer shards and replicas.  The list shouldn't drastically change but it will change some.  How drastic are the changes?

Also, are there things from the search results that you need that you can't get by using it as a generator?  As in, if we allowed generators to return things in order somehow would that be enough for you to only make a single call?
Comment 18 Dmitry Brant 2014-11-20 15:21:06 UTC
(In reply to Nik Everett from comment #17)
> Also, are there things from the search results that you need that you can't
> get by using it as a generator?  As in, if we allowed generators to return
> things in order somehow would that be enough for you to only make a single
> call?

That's correct -- if the generator returned the results in order, then we wouldn't need to do the other call.
Comment 19 Brad Jorsch 2014-11-20 15:33:52 UTC
(In reply to Nik Everett from comment #17)
> The list shouldn't drastically change but it will change some.  How drastic are
> the changes?

The changes don't seem to be very drastic; in some quick checking, it seems to mainly be in the ordering of the results. For example, in one query using only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14 while it dropped to #16 when the query was repeated, with no change in the relative order of any of the other 24 results.

> Also, are there things from the search results that you need that you can't
> get by using it as a generator?  As in, if we allowed generators to return
> things in order somehow would that be enough for you to only make a single
> call?

Allowing generators to return things in order isn't likely to happen. More likely is that generator=prefixsearch would be able to add an 'index' field into the hashes inside the 'pages' hash, so the client could sort based on that field's value.

More thoughts on this are at https://www.mediawiki.org/wiki/API/Architecture_work/Planning#Allow_generators_to_provide_data
Comment 20 Nik Everett 2014-11-20 15:39:41 UTC
(In reply to Brad Jorsch from comment #19)
> (In reply to Nik Everett from comment #17)
> > The list shouldn't drastically change but it will change some.  How drastic are
> > the changes?
> 
> The changes don't seem to be very drastic; in some quick checking, it seems
> to mainly be in the ordering of the results. For example, in one query using
> only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14
> while it dropped to #16 when the query was repeated, with no change in the
> relative order of any of the other 24 results.

Cool.  That's pretty much what I expect.

> 
> > Also, are there things from the search results that you need that you can't
> > get by using it as a generator?  As in, if we allowed generators to return
> > things in order somehow would that be enough for you to only make a single
> > call?
> 
> Allowing generators to return things in order isn't likely to happen. More
> likely is that generator=prefixsearch would be able to add an 'index' field
> into the hashes inside the 'pages' hash, so the client could sort based on
> that field's value.
> 
> More thoughts on this are at
> https://www.mediawiki.org/wiki/API/Architecture_work/
> Planning#Allow_generators_to_provide_data

Cool.
Comment 21 Nik Everett 2014-11-20 16:57:54 UTC
Chad just SWATed our fix to prefix search relevance.  You still get redirects if they are very good and there aren't any good regular pages.  Try searching for "ozymandius".  Its a common misspelling of Ozymandias.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links