Last modified: 2013-10-21 02:32:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44803, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42803 - Apply tag minimization on incoming HTML (ex: merge identical adjacent i/b tags)
Apply tag minimization on incoming HTML (ex: merge identical adjacent i/b tags)
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
DOM (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: ssastry
:
: 48194 52876 53208 (view as bug list)
Depends on:
Blocks: 48194
  Show dependency treegraph
 
Reported: 2012-12-06 23:28 UTC by ssastry
Modified: 2013-10-21 02:32 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description ssastry 2012-12-06 23:28:24 UTC
The normalize_document algo that rewrites dom to deal with nested inline tags does not work in the presence of mw:StartTag and mw:EndTag meta-tags.  The algo needs fixup to handle that.  Plus, the reordered nodes also need their tsr and other attrs. patched up.

longest_linear_path function is one that checks for children.length > 1 -- and this check is not valid anymore.  Look for other such checks and update.
Comment 1 C. Scott Ananian 2013-04-09 21:49:52 UTC
CSA notes: see http://www.mediawiki.org/wiki/Parsoid/PostProcessor:DOM_Tag_Minimization

Also -- we might need to handle <a> tags in this algorithm, to ensure that <a> doesn't get broken up by reordering the other tags.  In particular, look at the wikitext snippet:

''Something [http://example.com mixed''''', even bold]'''

which should not break up the <a> tag.
Comment 2 Gerrit Notification Bot 2013-04-24 01:11:18 UTC
Related URL: https://gerrit.wikimedia.org/r/60617 (Gerrit Change I25711292536b66d33c8228ca01c526ccfa93875b)
Comment 3 C. Scott Ananian 2013-04-24 01:12:46 UTC
The tag minimization is running out of stack space.  The algorithm needs to be rewritten with an external stack.

See https://gerrit.wikimedia.org/r/60617
Comment 4 Gabriel Wicke 2013-05-07 23:45:40 UTC
Disabling this (predictably) breaks our serializer handling of adjacent i/b tags, which was reported in bug 48194. We should write an iterative version of longest_linear_path.
Comment 5 ssastry 2013-05-08 02:47:30 UTC
Why does disabling the DOM pass (on wt2html) break our serializer handling of i/b tags on (html2wt)?
Comment 6 Gabriel Wicke 2013-05-08 03:37:09 UTC
Something like <i>foo</i><i>bar</i> *could* be serialized as ''foo''<nowiki/>''bar'', but that is not what we are currently doing (and nor is it very desirable IMO).

Minimization avoids this issue by merging adjacent italic/bold ranges. With it enabled, the result will be ''foobar''.
Comment 7 ssastry 2013-05-08 04:05:59 UTC
But, <i>foo</i><<i>bar</i> could be fresh content and not be from original source.  Are you suggesting we run this on the dom we get from VE before serialization?
Comment 8 Gabriel Wicke 2013-05-08 04:08:36 UTC
Yes, that was the original idea. Minimization on the way out is still nice for round-tripping (and HTML sanity), but on the way in it is actually important for serialization.
Comment 10 ssastry 2013-05-08 04:13:02 UTC
For on the way in, meta tags are not an issue since.  So, it can already be applied if we fix the stack space issue.

> var dpp = require('./mediawiki.DOMPostProcessor.js'), d = require('./domino.js')
undefined
> var doc = d.createDocument('<i>foo</i><i>bar</i>')
undefined
> dpp.normalizeDocument(doc)
undefined
> doc.outerHTML
'<html><head></head><body><i>foobar</i></body></html>'
Comment 11 ssastry 2013-05-08 04:20:14 UTC
Anyway, but yes, that code is quite stale and requires upgrade and testing and can be enabled on the DOM on the way in.  Not sure if scott is still interested in this ticket.  Otherwise, I can tackle this one of these days.
Comment 12 ssastry 2013-05-08 04:21:07 UTC
Commit 1c80e2d7f06403a49c80abfbd69b8b1fad039786 has some test DOMs.
Comment 13 Gabriel Wicke 2013-06-12 14:49:35 UTC
*** Bug 49478 has been marked as a duplicate of this bug. ***
Comment 14 Gabriel Wicke 2013-06-18 21:02:37 UTC
*** Bug 49766 has been marked as a duplicate of this bug. ***
Comment 15 Ed Sanders 2013-06-21 11:26:29 UTC
*** Bug 49873 has been marked as a duplicate of this bug. ***
Comment 16 Gerrit Notification Bot 2013-06-21 11:28:39 UTC
Related URL: https://gerrit.wikimedia.org/r/69852 (Gerrit Change Ibb18b4f653c664e8ab7876498dc8395d878f7aaa)
Comment 17 C. Scott Ananian 2013-07-12 17:09:28 UTC
Just recapping an earlier IRC discussion: my opinion is that the general form of this bug is too aggressive (globally rewrites the input too drastically) and unnecessary.  I prefer to see local fixes such as the one in comment 16 that address the straightforward local cases (like merging adjacent <span> or other tags with identical attributes) rather than trying to find globally-optimal tag nestings (which may create surprising long-distance mutations which can also be costly to search for).

My opinion may well be wrong.  Hopefully the bugs linked as dups can be evaluated in this light; one of them may well be the killer case which simply can't be solved with a local transform and instead needs the resurrection of the global tag minimization code.  I look forward to being thus shown incorrect. ;)
Comment 18 Gabriel Wicke 2013-07-12 19:28:42 UTC
(In reply to comment #17)
> Just recapping an earlier IRC discussion: my opinion is that the general form
> of this bug is too aggressive (globally rewrites the input too drastically)
> and
> unnecessary.

The incoming normalization will only apply to ranges where at least a part is new or modified content. So no global rewrites.

A proper solution should not be thrown off by other unrelated formatting elements, so only looking for siblings won't quite cut it. This is easier in VE as they are using flat annotations instead of elements (see comment 16). Our equivalent needs to flatten nested elements into ranges, which is basically what this algo already implements.
Comment 19 Ed Sanders 2013-07-13 10:44:43 UTC
With this change: https://gerrit.wikimedia.org/r/#/c/73523/ it is now possible that VE will send back adjacent annotations, e.g. by loading a document with two bolds separated by text, and then deleting the text between them, we will send back <b>Foo</b><b>Bar</b>, which currently serialises to '''Foo''''''Bar''' (which goes back to <b>Foo'<i>'</i>Foo</b>) so there is potential for corruption.
Comment 20 ssastry 2013-07-15 03:56:38 UTC
Given all the discussion we've had, I think we should implement a quick and simple solution for now.  Here is my proposal.

As far as I can tell, the only place where tag reordering and minimization matters is where wikitext I and B tags end up as siblings.  So, just targeting that situation should give what we want.

1. ..</i><i> ..
2. ..</b><b> ..
3. ..</i></b><i> ..
4. ..</b></i><b> ..

These 4 scenarios boil down to two cases which can be handled fairly easily in the serializer.

While HTML like this <s><i><b>a</b></i></s><b>c</b> (and other variants) could also be minimized (and would indeed be handled by the algorithm under discussion that I implemented), serializing this to <s>'''''a'''''</s>'''c'''  should be sufficient since there is no ambiguity in the parse in the front end and hence no corruption.
Comment 21 Gabriel Wicke 2013-08-16 00:23:23 UTC
*** Bug 52876 has been marked as a duplicate of this bug. ***
Comment 22 Gabriel Wicke 2013-08-16 00:32:12 UTC
(In reply to comment #20)
> Given all the discussion we've had, I think we should implement a quick and
> simple solution for now.  Here is my proposal.
> 
> As far as I can tell, the only place where tag reordering and minimization
> matters is where wikitext I and B tags end up as siblings.  So, just
> targeting
> that situation should give what we want.
> 
> 1. ..</i><i> ..
> 2. ..</b><b> ..
> 3. ..</i></b><i> ..
> 4. ..</b></i><b> ..
> 
> These 4 scenarios boil down to two cases which can be handled fairly easily
> in
> the serializer.

We also need to consider variants where other tags are in the mix. Example:

<i><ins><b>bold-italic</b></ins></i><b>bold</b>

This happens to be a case that is handled by our minimization algorithm. The main deficiency is that the current recursive implementation can run out of stack space: https://gerrit.wikimedia.org/r/#/c/60617/

The other issue is that it currently applies to both old and new content. For incoming content it should only apply to b/i pairs in new/modified content.

I still think the simplest solution is to fix the algorithm to be iterative, and to enable it only on new / modified incoming content on the way in.
Comment 23 ssastry 2013-08-16 14:50:50 UTC
I think for that example: ''<ins>'''bold-italic'''</ins>'''''bold''' is an acceptable serialization which parses back to the same HTML.

But, the minimized HTML <b><i><ins>bold-italic</ins></i>bold</b> can be serialized to '''''<ins>bold-italic></ins>''bold'''.  While shorter, this does not parse back to the original HTML, but to <i><b><ins>bold-italic</ins></b></i><b>bold</b>.

I am no longer convinced that we need a generic minimization solution rather than a special-case solution just for b-i tags.  This will just let us strip out a big part of our codebase.

So, I am going to tackle this bug in the next couple days and first attempt the simplified solution and fall back to fixing the existing algorithm otherwise.
Comment 24 ssastry 2013-08-16 14:55:47 UTC
That said, a small part of me will be sad at not using it, because that algorithm was pretty much one of my first commits to the codebase ;-) and I think it is a neat general-purpose algorithm for rewriting interleaved tags to a minimal form.
Comment 25 Gabriel Wicke 2013-08-16 16:05:01 UTC
(In reply to comment #23)
> I think for that example: ''<ins>'''bold-italic'''</ins>'''''bold''' is an
> acceptable serialization which parses back to the same HTML.
> 
> But, the minimized HTML <b><i><ins>bold-italic</ins></i>bold</b> can be
> serialized to '''''<ins>bold-italic></ins>''bold'''.  While shorter, this
> does
> not parse back to the original HTML, but to
> <i><b><ins>bold-italic</ins></b></i><b>bold</b>.

Which then serializes to 

'''''<ins>bold-italic</ins>''bold'''

This is semantically identical, and round-tripping is no issue for new or modified content. Do you see any issue in simply doing this apart from the need to fix up the recursion issue?
Comment 26 ssastry 2013-08-16 16:27:46 UTC
That (I think that function can easily be converted to tail-recursive form, if it is not already, and then iteration) + feels like using a very heavyweight technique for a relatively simple problem + deleting code is almost always a good thing.  It would be more of a no-brainer if there were other scenarios that required this generic solution.
Comment 27 Gerrit Notification Bot 2013-08-22 04:45:30 UTC
Change 80335 had a related patch set uploaded by Subramanya Sastry:
(Bug 42803) Apply I/B minimization to incoming HTML

https://gerrit.wikimedia.org/r/80335
Comment 28 ssastry 2013-08-22 22:06:26 UTC
*** Bug 48194 has been marked as a duplicate of this bug. ***
Comment 29 Gerrit Notification Bot 2013-08-23 23:55:31 UTC
Change 80335 merged by jenkins-bot:
(Bug 42803) Apply I/B minimization to incoming HTML

https://gerrit.wikimedia.org/r/80335
Comment 30 James Forrester 2013-09-25 01:29:56 UTC
*** Bug 53208 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links