Last modified: 2014-04-15 16:43:51 UTC
I'm not sure the cause of this one: old search has three results and cirrus has two but should have all three. Old: https://de.wikipedia.org/wiki/Special:Search?profile=advanced&search=stolpersteine+prefix%3APortal+Diskussion%3ANationalsozialismus%2F&fulltext=Search&ns0=1&ns4=1&ns10=1&ns12=1&redirs=1&profile=advanced Cirrus: https://de.wikipedia.org/wiki/Special:Search?profile=advanced&search=stolpersteine+prefix%3APortal+Diskussion%3ANationalsozialismus%2F&fulltext=Search&ns0=1&ns4=1&ns10=1&ns12=1&redirs=1&profile=advanced&srbackend=CirrusSearch
Filing high because I'm not sure what is up with it.
Added some See Also bugs which might be the cause. Or might not.
Not quite sure what is going on but this actually works in dev but not production. Both enwiki and dewiki don't split on the ":" but my dev machines do. http://localhost:1234/dewiki_content/_analyze?analyzer=text&text=Kategorie:Stolpersteine { "tokens": [ { "token": "kategorie:stolperstein", "start_offset": 0, "end_offset": 23, "type": "<ALPHANUM>", "position": 1 } ] }
Ah, what is saving me in dev is the $wgCirrusSearchUseAggressiveSplitting setting which _is_ enabled on mediawiki.org but only works in English. The problem with enabling it everywhere is that it only works in English right now and might make it harder to find things.... Let me see what I can do about that.
Stalling this for a moment while I wait on input from Dan and Chad. At question is whether to get aggressive splitting working everywhere or to use a smaller fix to get just colons. I'd like to unify everywhere on aggressive splitting to make regression testing easier and so I don't have the confusion of some environment having it and some not.
I've gotten input: we should push aggressive splitting everywhere we can sensibly do it. I've filed https://github.com/elasticsearch/elasticsearch/issues/5648 upstream so we can more easily edit the analyzers built in to elasticsearch. Right now editing them requires rebuilding them as "custom" analyzers by hand which is error prone. The issue would let us instruct Elasticsearch to rebuild them as custom analyzers and then we could make incremental changes to them. We don't actually need the issue closed upstream to work on this here, but we will need it for a few languages because some of the language analyzers can't actually be rebuilt as custom analyzers: Persian, Thai, and German I believe.