Last modified: 2014-11-16 03:06:37 UTC
Created attachment 17139 [details] Showing Tamil consonant sequences Reference page: https://ta.wikipedia.org/s/4om If you see in the above page, 'ஜ' follows 'ச'. Characters like 'ஸ', 'ஷ', 'ஜ', 'ஹ' etc. are called grantha characters which are not part of the basic alphabets of Tamil. See https://en.wikipedia.org/wiki/Tamil_script#Basic_consonants They are added towards the end (i.e. after 'ன') by convention. The first column in the attached image shows the correct sequence. (Image source: Naga. Ilangovan)
I think we need to add Collation support for Tamil (not sure if we need upstream libicu stuff), and look at getting the category collation updated on tawiki
ICU appears to support Tamil (http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt), so we only need to add it to the list of supported collations and perhaps adjust first-letter generation. (And then confirm that it actually sorts the words correctly.)
Thanks Same Reed and Bartosz Dziewoński. Yes, http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt is correct for the consonant sequence. We just need to validate the overall sequence of vowels, consonants, compounds.
Following are the two other related issues that I would like to be added to this bug report. 1) The sort position of letter ஃ should be after all the vowels. Currently, it is positioned like ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that, upon sorting all the Tamil letters, ஃ should appear after ஔ and before க். 2) The Consonant letters should appear on top of their compounding forms. If we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ). Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can be understood by sorting a few strings. Given the set of 4 strings as (கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி, கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம், கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ .
(In reply to elan from comment #4) > Following are the two other related issues that I would like to be added to > this bug report. > > 1) The sort position of letter ஃ should be after all the vowels. Currently, > it is positioned like ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right > order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that, > upon sorting all the Tamil letters, ஃ should appear after ஔ and before க். > > 2) The Consonant letters should appear on top of their compounding forms. If > we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ). > Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can > be understood by sorting a few strings. Given the set of 4 strings as > (கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி, > கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம், > கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according > to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ . Yes, the above sequence is the correct order.