Last modified: 2013-09-26 19:51:54 UTC
Elasticsearch provides a "More Like This" query, which - given some initial text - extracts the key terms and uses those to build a new query, returning the documents that best match those terms. If this was available in MediaWiki's search API, it would allow the index to be queried by example. This can be useful for finding Wikipedia articles that are most similar to a starting document (e.g. "Wikipedia articles related to this page", alongside a news story), and also for automatically categorising documents (using the categories that have been attached to the most similar Wikipedia articles). An example query: https://gist.github.com/hubgit/6365895 Most of those parameters (fields to query, fields to return, number of items to return, query text) can be passed through as query parameters, and the others (min_term_freq, max_query_terms, percent_terms_to_match) can be hard-coded to values appropriate for the index. It might be appropriate to use POST for the query, as the query text can be a whole document.
Elasticsearch more_like_this query documentation: http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query/
Implementation: https://gerrit.wikimedia.org/r/#/c/83807/ Tests comming.
Change 83819 had a related patch set uploaded by Manybubbles: Tests for morelike:. https://gerrit.wikimedia.org/r/83819
Gerrit Notification Bot just added my integration tests to the bug as well. So what I've implemented is the users can search for: morelike:<article name> and we do a mlt search against the article's text.
Change 83819 merged by jenkins-bot: Tests for morelike:. https://gerrit.wikimedia.org/r/83819
Verified on enwikisource. It is slow! 12 second searches. It is fun though.