Last modified: 2013-06-13 01:23:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50630, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48630 - VisualEditor: Data model needs characters, not code points
VisualEditor: Data model needs characters, not code points
Status: RESOLVED FIXED
Product: VisualEditor
Classification: Unclassified
Data Model (Other open bugs)
unspecified
All All
: Normal normal
: VE-deploy-2013-06-13
Assigned To: Ed Sanders
:
Depends on:
Blocks: ve-multi-lingual 48426
  Show dependency treegraph
 
Reported: 2013-05-20 05:55 UTC by D Chan
Modified: 2013-06-13 01:23 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description D Chan 2013-05-20 05:55:47 UTC
At present, the VisualEditor treats UTF-16 code points as if they were synonymous with abstract characters. Here are two cases where this causes bugs:

1) UTF-16 uses a surrogate pair to represent each Unicode character above U+FFFF. For instance, U+282E2 ('elevator' in Cantonese) is a single character represented in Javascript as "\uD860\uDEE2". In a plain textarea, this behaves like a single character from the point of view of the user. However in the VisualEditor, cursoring and backspacing requires two presses; and after cursoring once, any text typed will go in the middle of the surrogate pair, creating invalid UTF-16. (see The Unicode Standard, Version 6.2, Section 3.8, Surrogates).

2) Combining accents can be used in sequences to build up abstract characters. For example, the Javascript string "m\u0300" represents a single abstract character (m with grave accent). In a plain textarea, this behaves like a single character when cursoring, but like two characters when backspacing (so the first backspace just removes the accent). However in the VisualEditor, cursoring requires two presses; and after cursoring once, any typed text will go between the letter and the accent, creating an inappropriate dangling combining accent.

These kinds of issues occur because the DataModel uses Arrays with code point elements, say ['\uD860', '\uDEE2', ..., 'm', '\u0300']). My hunch is that this is slightly too low level, and it should instead use abstract character elements, say ['\uD860\uDEE2', ..., 'm\u0300'], where each element represents a whole character.

A good start would be to abstract out away calls to string.split( '' ) into a single function like this:

  ve.splitCharacters = function ( value ) {
      return value.split( /(?![\uDC00-\uDFFF])/ ); // don't split surrogate pairs
  };

The rest of the codebase should call this function to perform splits, and then not assume that data[i] is a single character. Then we can refine splitCharacters as needed.

Alternatively, since the overwhelming majority of characters will in fact be single code points, perhaps the DataModel structure could "encode" the exceptional multi-code point characters as objects, so that 'typeof data[i] === "string"' can still detect the simple cases.

This sounds like a big change for a small issue, but I think it would avoid problems in the future. With a character representation, you can safely perform useful operations like splicing and truncating without having to check the surrounding context very carefully every time.
Comment 1 D Chan 2013-05-20 06:04:14 UTC
A related question is when invalid UTF-16 is checked and how errors are handled. 

I don't know enough to suggest where in the system the checks should be. But it obviously relates to the parts of ve.dm.Converter and ve.ce.Surface (for paste) which would call this new splitCharacters function. Note that two chunks of invalid UTF-16 can concatenate into one valid but unexpected chunk.
Comment 2 Ed Sanders 2013-05-20 10:47:26 UTC
Ideally merging would be done at the transaction level (e.g. in fixupInsertion), although we would need to extend that method to be allowed to modify the remove part of a transaction, e.g.

Data: ['F','o','o']

Transaction:
  Retain: 3
  Replace:
    Remove: ''
    Insert: '<accent>'

Would become:

Transaction:
  Retain: 2
  Replace:
    Remove: 'o'
    Insert: 'o<accent>'
Comment 3 Ed Sanders 2013-05-20 11:17:13 UTC
Another major issue is that window.getSelection (which is what we use to get cursor position, via Rangy) calculates using number of codepoints, not characters, so this would need to be translated.
Comment 4 Ed Sanders 2013-05-20 11:29:55 UTC
(In reply to comment #3)
> Another major issue is that window.getSelection (which is what we use to get
> cursor position, via Rangy) calculates using number of codepoints, not
> characters, so this would need to be translated.

Which would be done in ve.ce.getOffsetFromTextNode
Comment 5 Gerrit Notification Bot 2013-05-22 18:40:49 UTC
Related URL: https://gerrit.wikimedia.org/r/64965 (Gerrit Change I8d936fb15d82f73cd45fac142c540a7950850d55)
Comment 6 D Chan 2013-05-23 11:38:57 UTC
To be precise about what should constitute a character, see Grapheme Cluster Boundary Rules from TR29 (http://unicode.org/reports/tr29/) . We probably want to implement these, except insofar as it differs from the browsers' implementations of what constitutes a character for cursoring purposes.

The important thing right now is to abstract the code away from assuming single-codepoint characters, so the structure is correct. Then we can get the splitting rules right in isolation.
Comment 7 James Forrester 2013-05-25 15:59:54 UTC
Ed's main block of code is merged and will go out with wmf5; keeping this open for follow-up, and removing milestone of tomorrow.
Comment 8 James Forrester 2013-06-06 15:15:22 UTC
*** Bug 49233 has been marked as a duplicate of this bug. ***
Comment 9 James Forrester 2013-06-06 15:26:22 UTC
Marking this as merged; follow-up in bug 49257.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links