Last modified: 2013-07-04 10:33:45 UTC
Test case: echo -e '<FONT COLOR="#000000">a</FONT><FONT COLOR="#FF3300">b</FONT>' | nodejs parse --wt2wt <font COLOR="#000000">a<font COLOR="#FF3300">b
This seems to be a bug (?) in the HTML5 tree builder. It seems to be deleting upper-case tags and then implicitly closes unmatched tags which is what introduces the bug reproted here. https://gist.github.com/51b7a850b2b5cb3f3579 demonstrates this. While we can fix this by sending the html5 tree builder lower case tag names, this will introduce dirty diffs on upper vs. lower case. But then we are already introducing this right now since we are not tracking the source-case, so maybe not a big deal if we rely on selective serialization to deal with this.
Pasting the contents of the gist inline here in case that goes away at a later point. [subbu@earth tests] echo "<b>foo</b><i>bar</i>" | node parse.js --trace html ---- <chunk> ---- T:html: {"type":"TagTk","name":"p","attribs":[],"dataAttribs":{}} T:html: {"type":"TagTk","name":"b","attribs":[],"dataAttribs":{"tsr":[0,3],"stx":"html"}} T:html: "foo" T:html: {"type":"EndTagTk","name":"b","attribs":[],"dataAttribs":{"tsr":[6,10],"stx":"html"}} T:html: {"type":"TagTk","name":"i","attribs":[],"dataAttribs":{"tsr":[10,13],"stx":"html"}} T:html: "bar" T:html: {"type":"EndTagTk","name":"i","attribs":[],"dataAttribs":{"tsr":[16,20],"stx":"html"}} T:html: {"type":"EndTagTk","name":"p","attribs":[],"dataAttribs":{}} T:html: {"type":"NlTk","dataAttribs":{}} T:html: {"type":"EOFTk"} ---- </chunk> ---- <p data-parsoid="{"dsr":[0,20]}"><b data-parsoid="{"tsr":[0,3],"stx":"html","dsr":[0,10]}">foo</b><i data-parsoid="{"tsr":[10,13],"stx":"html","dsr":[10,20]}">bar</i></p> [subbu@earth tests] echo "<B>foo</B><I>bar</I>" | node parse.js --trace html ---- <chunk> ---- T:html: {"type":"TagTk","name":"p","attribs":[],"dataAttribs":{}} T:html: {"type":"TagTk","name":"B","attribs":[],"dataAttribs":{"tsr":[0,3],"stx":"html"}} T:html: "foo" T:html: {"type":"EndTagTk","name":"B","attribs":[],"dataAttribs":{"tsr":[6,10],"stx":"html"}} T:html: {"type":"TagTk","name":"I","attribs":[],"dataAttribs":{"tsr":[10,13],"stx":"html"}} T:html: "bar" T:html: {"type":"EndTagTk","name":"I","attribs":[],"dataAttribs":{"tsr":[16,20],"stx":"html"}} T:html: {"type":"EndTagTk","name":"p","attribs":[],"dataAttribs":{}} T:html: {"type":"NlTk","dataAttribs":{}} T:html: {"type":"EOFTk"} ---- </chunk> ---- <p data-parsoid="{"dsr":[0,20]}"><b data-parsoid="{"tsr":[0,3],"stx":"html","autoInsertedEnd":true,"dsr":[0,20]}">foo<i data-parsoid="{"tsr":[10,13],"stx":"html","autoInsertedEnd":true,"dsr":[10,null]}">bar</i></b></p>
Fixed in https://gerrit.wikimedia.org/r/34364
[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]