Last modified: 2014-08-13 19:53:45 UTC
Split from bug 54022: apart from the 30 languages currently supported, rather than use the default analyzer bare we should probably use stopwords calculated in an automatic way, while we wait for a custom ones to be made. It seems cutoff_frequency setting and common_terms query may be used for this purpose. I'd say that this is currently low priority but should probably be done before expanding elasticsearch beyond the ~30 supported languages.
I'm not sure this should be a hard requirement before expanding beyond the ~30 languages with built in stop words. I certainly agree we should do it though.
Hmm, I wonder: http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/
I believe that is what nemo was referring to. The problem (right now) is that was use query string queries rather than term queries. For what we do, it makes a lot of sense. Anyway, query string queries don't play nice right yet with common terms queries. They could possibly be made to but I'm not sure about that yet. It'd probably make more sense to make this change in elasticsearch and for us to just flip the switch to turn it on.
I don't know anything about implementation details but yes, that would seem the most elegant way to handle it from the small hints I gathered around. However, it may also be viable to automatically generate "standard" stopwords lists for each language, from what I understand.
https://github.com/elasticsearch/elasticsearch/pull/5005