Last modified: 2014-11-17 11:10:53 UTC
The detectTofu function finds glyphs which are missing from a font (and so are replaced by a replacement character or "tofu"). The current algorithm works as follows: 1. Measure the rendered width/height of each character in a test string. 2. Compare to a character that is known to be replaced. 3. If each character is the same size (including the replacement), then conclude that all characters are missing glyphs. This works very well for many languages. However, it fails for Chinese, because typically all Han character glyphs in a font are the same size as the replacement character glyph. Also, there is no such thing as a 'complete Han font': there are always missing characters. Therefore, we should implement a more sophisticated approach: 1. Start with the above algorithm for speed. 2. Render a character to an HTML canvas. 3. Compare its bitmap to the bitmap of the replacement character glyph. This will allow us to detect exactly which characters are missing, regardless of width/height.
Please don't take this bug unless you are a GSoC student working on Bug 31791 - Add web fonts for Chinese scripts. Thank you.
Created attachment 14941 [details] Patch for the detect chinese tofu function Hope it can work. I just finish the function, while don't know how to integrate it with ULS currently.
Created attachment 14942 [details] A simple test page it will alert a tofu char, and a not-tofu char
(In reply to Xiangquan Xiao from comment #2) > Created attachment 14941 [details] > Patch for the detect chinese tofu function Thanks for your patch! You are welcome to use Developer access https://www.mediawiki.org/wiki/Developer_access to submit this as a Git branch directly into Gerrit: https://www.mediawiki.org/wiki/Git/Tutorial Putting your branch in Git makes it easier to review it quickly. If you don't want to set up Git/Gerrit, you can also use https://tools.wmflabs.org/gerrit-patch-uploader/ Thanks again! We appreciate your contribution.
(In reply to Andre Klapper from comment #4) > > Putting your branch in Git makes it easier to review it quickly. If you > don't want to set up Git/Gerrit, you can also use > https://tools.wmflabs.org/gerrit-patch-uploader/ > Thanks again! We appreciate your contribution. Thanks for the information. I've set up gerrit by following the tips. Actually it's an incomplete fix. So I just leave the test page there to show how it works, as my GSoC application's microtask. A complete fix will be submitted soon using Gerrit.
Thanks Xiangquan, that's an extremely good start! When you submit to gerrit, I'll post more detailed comments there. Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a line of its own, immediately above the change ID, with no extra whitespace. Then gerrit will post comments automatically to this bug.
(In reply to David Chan from comment #6) > Thanks Xiangquan, that's an extremely good start! > > When you submit to gerrit, I'll post more detailed comments there. > > Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a > line of its own, immediately above the change ID, with no extra whitespace. > Then gerrit will post comments automatically to this bug. Hi, I want to make something clear. 1. Do we need a seperate function, like detectChineseTofu(), just as I did in the previous patch? If so, in which scene will it be called? 2. Or it's an improvement on the old detectTofu() to make it applicable to Chinese. If so, may I just cover the old solution, as the new one (comparing image) will work for almost all languages. Though it's slower than only comparing widths and heights, a unified solution looks much simpler.
(2) is correct. Your method is more precise and works for more languages. However the old method is faster[*], and completely reliable if it returns false. Therefore we should do the following pseudo-code: function detectTofu ( text ) { maybeTofu = <old technique>; if ( maybeTofu ) { isTofu = <new technique>; } else { isTofu = false; } return isTofu; } [*] I *presume* the old method is faster, but I have not actually tested this. Feel free to do so and to post actual numbers here!
(In reply to David Chan from comment #8) > (2) is correct. Your method is more precise and works for more languages. > However the old method is faster[*], and completely reliable if it returns > false. Therefore we should do the following pseudo-code: > > function detectTofu ( text ) { > maybeTofu = <old technique>; > if ( maybeTofu ) { > isTofu = <new technique>; > } else { > isTofu = false; > } > return isTofu; > } Hi, how about a sentence only contains 1 tofu, which is common in languages like Chinese? detectTofu(text) will return true in such situation. Is that correct? BTW, I'll test the performance of both techniques and post result here :)
Change 122277 had a related patch set uploaded by Xiaoxiangquan: uls: Improve detectTofu algorithm to detect fixed-width glyphs https://gerrit.wikimedia.org/r/122277
(In reply to Gerrit Notification Bot from comment #10) > Change 122277 had a related patch set uploaded by Xiaoxiangquan: > uls: Improve detectTofu algorithm to detect fixed-width glyphs > > https://gerrit.wikimedia.org/r/122277 Sorry I havn't setup a testing-environment well ( trying vagrant currently ), so it's not well tested. I tried to make it bug free.