Last modified: 2014-10-08 13:03:29 UTC
We have some langues such as Arabic, Persian, Urdu, Kurdish,... which uses common characters and they have similar geliphs with different Unicode number for example: for ک (Kaf) ك Arabic U+0643 ڪ Urdu U+06AA ﻙ Pushtu U+FED9 ﻚ Uyghur U+FEDA ک Persian U+06A9 for ی (ya) ی Persian U+06CC ي Arabic U+064A ى Urdu U+0649 ۍ Pushtu U+06CD ې Uyghur U+06D0 for ه (heh) ہ Pushtu U+06C1 ە Kurdish U+06D5 ه Persian U+0647 we have these characters which have different Unicode number and different keyboard. Now many users does not access to Persian keyboard or urdu keyboard by default in their OS (like windows xp, android (low versions), IOS ,...). so when they search for an article they can not find it in wikipedia searach box but it is existing in local characters. For example if you search at fa.wikipedia for article ويليام شكسپير (characters are in Arabic ي , ك) you can not find it and the article in Farsi is ویلیام شکسپیر (characters are in Persian ی , ک). for farsi please add a possibility for search tool to assume U+064A or U+0649 or U+06CD or U+06D0 or U+06CC > U+06CC U+0643 or U+06AA or U+FED9 or U+FEDA > U+06A9 U+06C1 or U+06D5 > U+0647
Yes, we have a same problem on ckb wikipedia. It can be useful.
may be for fa.wikipedia or ckb.wikipedia we needs some normalization like https://github.com/wikimedia/mediawiki-core/blob/master/languages/classes/LanguageAr.php
and https://github.com/wikimedia/mediawiki-core/blob/master/maintenance/language/generateNormalizerDataAr.php
Is this request about CirrusSearch or about LuceneSearch (deprecated)?
(In reply to Andre Klapper from comment #4) > Is this request about CirrusSearch or about LuceneSearch (deprecated)? We need normalization for search box which is placed on the top pages.