Last modified: 2013-09-26 19:51:54 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55474, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53474 - Allow more_like_this searches (return articles similar to query text)
Allow more_like_this searches (return articles similar to query text)
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-28 14:26 UTC by alf
Modified: 2013-09-26 19:51 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description alf 2013-08-28 14:26:02 UTC
Elasticsearch provides a "More Like This" query, which - given some initial text - extracts the key terms and uses those to build a new query, returning the documents that best match those terms.

If this was available in MediaWiki's search API, it would allow the index to be queried by example. This can be useful for finding Wikipedia articles that are most similar to a starting document (e.g. "Wikipedia articles related to this page", alongside a news story), and also for automatically categorising documents (using the categories that have been attached to the most similar Wikipedia articles).

An example query: https://gist.github.com/hubgit/6365895

Most of those parameters (fields to query, fields to return, number of items to return, query text) can be passed through as query parameters, and the others (min_term_freq, max_query_terms, percent_terms_to_match) can be hard-coded to values appropriate for the index.

It might be appropriate to use POST for the query, as the query text can be a whole document.
Comment 1 alf 2013-08-28 14:27:20 UTC
Elasticsearch more_like_this query documentation: http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query/
Comment 2 Nik Everett 2013-09-11 16:22:33 UTC
Implementation: https://gerrit.wikimedia.org/r/#/c/83807/
Tests comming.
Comment 3 Gerrit Notification Bot 2013-09-11 17:05:55 UTC
Change 83819 had a related patch set uploaded by Manybubbles:
Tests for morelike:.

https://gerrit.wikimedia.org/r/83819
Comment 4 Nik Everett 2013-09-11 17:08:44 UTC
Gerrit Notification Bot just added my integration tests to the bug as well.


So what I've implemented is the users can search for:
morelike:<article name>
and we do a mlt search against the article's text.
Comment 5 Gerrit Notification Bot 2013-09-16 15:58:21 UTC
Change 83819 merged by jenkins-bot:
Tests for morelike:.

https://gerrit.wikimedia.org/r/83819
Comment 6 Nik Everett 2013-09-26 19:51:54 UTC
Verified on enwikisource.  It is slow!  12 second searches.  It is fun though.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links