Last modified: 2014-11-17 11:10:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65122, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 63122 - improve detectTofu algorithm so it can detect replacement characters in fixed-width glyphs


Summary:	improve detectTofu algorithm so it can detect replacement characters in fixed...

Status:	PATCH_TO_REVIEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	UniversalLanguageSelector (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	patch, patch-need-review

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-03-26 17:55 UTC by D Chan
Modified:	2014-11-17 11:10 UTC (History)
CC List:	15 users (show)

See Also:	31791
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Patch for the detect chinese tofu function (1.52 KB, patch) 2014-03-27 12:08 UTC, Xiangquan Xiao	Details
A simple test page (1.63 KB, text/html) 2014-03-27 12:10 UTC, Xiangquan Xiao	Details
Add an attachment (proposed patch, testcase, etc.)

Description D Chan 2014-03-26 17:55:03 UTC

The detectTofu function finds glyphs which are missing from a font (and so are replaced by a replacement character or "tofu").

The current algorithm works as follows:

1. Measure the rendered width/height of each character in a test string.
2. Compare to a character that is known to be replaced.
3. If each character is the same size (including the replacement), then conclude that all characters are missing glyphs.

This works very well for many languages. However, it fails for Chinese, because typically all Han character glyphs in a font are the same size as the replacement character glyph. Also, there is no such thing as a 'complete Han font': there are always missing characters.

Therefore, we should implement a more sophisticated approach:

1. Start with the above algorithm for speed.
2. Render a character to an HTML canvas.
3. Compare its bitmap to the bitmap of the replacement character glyph.

This will allow us to detect exactly which characters are missing, regardless of width/height.

Comment 1 Quim Gil 2014-03-26 18:00:01 UTC

Please don't take this bug unless you are a GSoC student working on Bug 31791 - Add web fonts for Chinese scripts. Thank you.

Comment 2 Xiangquan Xiao 2014-03-27 12:08:50 UTC

Created attachment 14941 [details]
Patch for the detect chinese tofu function

Hope it can work. I just finish the function, while don't know how to integrate it with ULS currently.

Comment 3 Xiangquan Xiao 2014-03-27 12:10:30 UTC

Created attachment 14942 [details]
A simple test page

it will alert a tofu char, and a not-tofu char

Comment 4 Andre Klapper 2014-03-27 12:57:21 UTC

(In reply to Xiangquan Xiao from comment #2)
> Created attachment 14941 [details]
> Patch for the detect chinese tofu function

Thanks for your patch!
You are welcome to use Developer access
  https://www.mediawiki.org/wiki/Developer_access
to submit this as a Git branch directly into Gerrit:
  https://www.mediawiki.org/wiki/Git/Tutorial

Putting your branch in Git makes it easier to review it quickly. If you don't want to set up Git/Gerrit, you can also use https://tools.wmflabs.org/gerrit-patch-uploader/
Thanks again! We appreciate your contribution.

Comment 5 Xiangquan Xiao 2014-03-27 14:50:15 UTC

(In reply to Andre Klapper from comment #4)
> 
> Putting your branch in Git makes it easier to review it quickly. If you
> don't want to set up Git/Gerrit, you can also use
> https://tools.wmflabs.org/gerrit-patch-uploader/
> Thanks again! We appreciate your contribution.

Thanks for the information. I've set up gerrit by following the tips.

Actually it's an incomplete fix. So I just leave the test page there to show how it works, as my GSoC application's microtask.

A complete fix will be submitted soon using Gerrit.

Comment 6 D Chan 2014-03-27 18:00:06 UTC

Thanks Xiangquan, that's an extremely good start!

When you submit to gerrit, I'll post more detailed comments there.

Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a line of its own, immediately above the change ID, with no extra whitespace.
Then gerrit will post comments automatically to this bug.

Comment 7 Xiangquan Xiao 2014-03-28 16:15:28 UTC

(In reply to David Chan from comment #6)
> Thanks Xiangquan, that's an extremely good start!
> 
> When you submit to gerrit, I'll post more detailed comments there.
> 
> Be sure to put 'Bug: 63122' (without quotes) in your commit message, on a
> line of its own, immediately above the change ID, with no extra whitespace.
> Then gerrit will post comments automatically to this bug.

Hi, I want to make something clear.

1. Do we need a seperate function, like detectChineseTofu(), just as I did in the previous patch? If so, in which scene will it be called?

2. Or it's an improvement on the old detectTofu() to make it applicable to Chinese. If so, may I just cover the old solution, as the new one (comparing image) will work for almost all languages. Though it's slower than only comparing widths and heights, a unified solution looks much simpler.

Comment 8 D Chan 2014-03-28 16:35:52 UTC

(2) is correct. Your method is more precise and works for more languages. However the old method is faster[*], and completely reliable if it returns false. Therefore we should do the following pseudo-code:

function detectTofu ( text ) {
    maybeTofu = <old technique>;
    if ( maybeTofu ) {
        isTofu = <new technique>;
    } else {
        isTofu = false;
    }
    return isTofu;
}

[*] I *presume* the old method is faster, but I have not actually tested this. Feel free to do so and to post actual numbers here!

Comment 9 Xiangquan Xiao 2014-03-30 15:57:30 UTC

(In reply to David Chan from comment #8)
> (2) is correct. Your method is more precise and works for more languages.
> However the old method is faster[*], and completely reliable if it returns
> false. Therefore we should do the following pseudo-code:
> 
> function detectTofu ( text ) {
>     maybeTofu = <old technique>;
>     if ( maybeTofu ) {
>         isTofu = <new technique>;
>     } else {
>         isTofu = false;
>     }
>     return isTofu;
> }

Hi, how about a sentence only contains 1 tofu, which is common in languages like Chinese? 
detectTofu(text) will return true in such situation. Is that correct?



BTW, I'll test the performance of both techniques and post result here :)

Comment 10 Gerrit Notification Bot 2014-03-31 04:47:54 UTC

Change 122277 had a related patch set uploaded by Xiaoxiangquan:
uls: Improve detectTofu algorithm to detect fixed-width glyphs

https://gerrit.wikimedia.org/r/122277

Comment 11 Xiangquan Xiao 2014-03-31 04:54:42 UTC

(In reply to Gerrit Notification Bot from comment #10)
> Change 122277 had a related patch set uploaded by Xiaoxiangquan:
> uls: Improve detectTofu algorithm to detect fixed-width glyphs
> 
> https://gerrit.wikimedia.org/r/122277

Sorry I havn't setup a testing-environment well ( trying vagrant currently ), so it's not well tested. I tried to make it bug free.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links