Last modified: 2013-12-03 23:08:19 UTC
http://parsoid.wmflabs.org/en/User:Adjwilley/sandbox10?oldid=562331693 Relevant wikitext: *{{Cite book| last=Fádil-i-Mázindarání | first= Asadu'lláh | authorlink = Mírzá Asadu’llah Fádil Mázandarání | year=1967 | title=Asráu'l-Á<u>th<u>ár, Vol.I| pages= 453 | publisher=Bahá'í Publishing Trust, Tehran | url = http://www.h-net.org/~bahai/areprint/authors/mazandarani/asrar.htm | ref=harv}} (note the <u> isn't opened then closed, but double-opened) Output HTML (with data-*, about and a few others stripped): <li><span typeof="mw:Transclusion"><a rel="mw:WikiLink" href="...">Fádil-i-Mázindarání, Asadu'lláh</a> (1967). <a rel="mw:ExtLink" href="..."><i>Asráu'l-Á<u>th<u>ár, Vol.I</u></u></i></a><u><u>. Bahá'í Publishing Trust, Tehran. p.<span typeof="mw:Entity"> </span>453.</u></u></span><u><u><span class="Z3988"><span style="display:none;"><span typeof="mw:Entity"> </span></span></span></u></u></li><u><u> <li>(next item here, note how we're still inside of <u><u>)
Diff showing resulting VE breakage: https://en.wikipedia.org/w/index.php?title=User:Adjwilley/sandbox10&curid=38366631&diff=562462441&oldid=562331693 . I'm not sure if the categories being dropped is on the VE or Parsoid side yet.
Since this is a plain transclusion this is pretty much the behavior I would expect. The formatting is processed in accordance with the HTML spec. I suspect tidy has a heuristic that closes the <u> at some point in contrast with the HTML spec. Something that looks broken on the Parsoid side is the missing encapsulation for formatting that spans list items: echo '<ul><li>{{echo|<u>}}</li><li>foo</li></ul>' | node parse <body><ul> <li><u about="#mwt1" typeof="mw:Transclusion" data-mw='{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<u>"}},"i":0}'></u></li> <li><u>foo</u></li> </ul> <u> </u></body> (reformatted the above a bit and removed data-parsoid for clarity) We should in theory be able to detect that the leaked <u>s were completely auto-inserted by the treebuilder. Maybe we can use that information to extend the encapsulation for cases like this one.
No, the tpl encapsulation looks correct to me. The tree builder correction in this case is quite broad and it would be hard to detect auto-correction as well. On this last example, there are no autoInserted* flags except on the innermost <u> tag. As long as we are using a html5 compliant tree builder, our options are a little limited on this kind of html without doing more complicated tracking/analyses. I am not convinced those are worth the trouble and headaches.
The template encapsulation does not catch all template-affected content, so is not correct really. Alternatively we should be able to undo the leak to match the PHP parser's semantics. That would remove the need for extending the encapsulation in this case.
Change 73369 had a related patch set uploaded by Subramanya Sastry: WIP: (Bug 50536) Improved handling of tree-builder fixup https://gerrit.wikimedia.org/r/73369
https://www.mediawiki.org/w/index.php?title=VisualEditor/Basic_example_worksheet&oldid=751458&veaction=edit "Foo <tt>bar<tt>\n == Heading == * Lists"
Re-titling to reflect our plan [1] to avoid the leaking and match PHP parser semantics instead., [1]: https://www.mediawiki.org/wiki/Parsoid/DOM_notes