Last modified: 2014-07-08 17:57:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58809, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 56809 - Newlines after </li> cause dirty diffs with Parsoid
Newlines after </li> cause dirty diffs with Parsoid
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Gabriel Wicke
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-08 21:56 UTC by C. Scott Ananian
Modified: 2014-07-08 17:57 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description C. Scott Ananian 2013-11-08 21:56:57 UTC
We would like to revert the fix for bug 39617.  That bug should have been fixed via the 'whitespace' CSS property.

The 'fix' to bug 39617 caused Parsoid to begin creating dirty diffs (and failing its test suite) -- see https://gerrit.wikimedia.org/r/92978
Comment 1 Gerrit Notification Bot 2013-11-08 22:20:14 UTC
Change 94443 had a related patch set uploaded by Cscott:
Revert "Have list items occupy their own line"

https://gerrit.wikimedia.org/r/94443
Comment 3 Bartosz Dziewoński 2013-11-08 23:17:48 UTC
Why does this cause dirty diffs? Can I get a link to such diff?
Comment 4 ssastry 2013-11-08 23:32:15 UTC
Matma(In reply to comment #3)
> Why does this cause dirty diffs? Can I get a link to such diff?

See the gerrit patch cscott linked to in the bug description (https://bugzilla.wikimedia.org/show_bug.cgi?id=56809#c0)
Comment 5 Bartosz Dziewoński 2013-11-08 23:35:41 UTC
I've already seen the patch and sadly it is not telling me anything. What does a "dirty diff" mean here? A dirty diff during after editing a page with VisualEditor? Because that's the only significant definition I know.
Comment 6 ssastry 2013-11-08 23:37:05 UTC
Also, see https://gerrit.wikimedia.org/r/#/c/93626/ where I started messing with Parsoid to prevent these dirty diffs, but it seems a lot more pain than worth it currently -- it is mostly a distraction from other critical issues that I just left it in WIP stage. This wont be an issue in production necessarily right now, but if the CSS fix is not employed, then Parsoid will also have to emit the same kind of HTML that the core php parser emits and then this could be an issue in production with occasional dirty diffs without this WIP patch being worked out to completion. So, seems like a lot of pain for unclear benefit.
Comment 7 ssastry 2013-11-08 23:39:42 UTC
(In reply to comment #5)
> I've already seen the patch and sadly it is not telling me anything. What
> does
> a "dirty diff" mean here? A dirty diff during after editing a page with
> VisualEditor? Because that's the only significant definition I know.

Parsoid is failing the parser tests in html2wt and html2html mode since the same parser tests are used by core and Parsoid. I blacklisted the failures for now. But this blacklisting is not ideal since we have to be a bit more careful about failures in the future when inspecting changes.

That aside, if VE and Parsoid emits the same HTML that the core parser does for lists (emitting newlines around <ul> tags), then we have to deal with it.
Comment 8 Erwin Dokter 2013-11-09 10:15:05 UTC
Oh for crying...! Please provide a COMPLETE working set of CSS for horizontal lists that supposedly fixes all the bugs I have spent TWO YEARS trying to fix! I have tested EVERY scenario, including inline-block, which breaks nested lists in *ALL* browsers.

Unless you come up with a working CSS solution (which I doubt you will), I insist my patch be reinstated and Parsoid be fixed instead.
Comment 9 Bartosz Dziewoński 2013-11-09 12:27:30 UTC
(In reply to comment #8)
> Unless you come up with a working CSS solution (which I doubt you will), I
> insist my patch be reinstated and Parsoid be fixed instead.

(Just a note that the patch was not reverted, but a revert has been submitted for review / consideration.)

Is there a demo of hlist set up somewhere? Or a list of things it does, and a list of cases it supports? Any of these things would be very useful here.
Comment 10 Erwin Dokter 2013-11-09 13:39:34 UTC
https://test.wikipedia.org/wiki/MediaWiki:Common.css contain current CSS for hlist.
https://test.wikipedia.org/wiki/User:Edokter/sandbox contains the test cases.

Any suggested changes will break one or more testcases.
Comment 11 Erwin Dokter 2013-11-09 13:42:26 UTC
(In reply to comment bug:39617#27)
> The Tidy bug is what caused hlists to work in the first place, according to
> what you're saying :)

Why is that even considered a bug? list items are block level elements, and as such, should be on their own line per unwritten code style conventions.
Comment 12 Bartosz Dziewoński 2013-11-09 13:51:56 UTC
(In reply to comment #11)
> (In reply to bug 39617 comment #27)
> > The Tidy bug is what caused hlists to work in the first place, according to
> > what you're saying :)
> 
> Why is that even considered a bug? list items are block level elements, and
> as such, should be on their own line per unwritten code style conventions.

Tidy's overzealous newline insertion has caused problems in the past already, see bug 260 (which is still open only because of the same Tidy issue) and bug 38800 (which is basically what Gabriel reported).
Comment 13 C. Scott Ananian 2013-11-09 14:25:29 UTC
Just to guide the discussion, let's try to keep this bug concentrated on the parsertest regressions in Parsoid and other extensions, and possible fixes for that.  Discussion of the hlist issue and the CSS fix (or lack thereof) goes over on bug 39617.

Since most of the discussion has already happened on bug 39617, I'd say that, if in doubt, let's continue the discussion there.  I'm just trying to prevent the discussion from being scattered across gerrit and a bunch of different bugzilla items, and people's good points being lost in the shuffle. (In particular, I'm going to copy comment 10 over to bug 39617.)
Comment 14 Erwin Dokter 2013-11-09 14:26:52 UTC
That is a scoping problem; Tidy should simply not touch anything that comes out of GeSHi (pre & code). This is simply two external extentions clashing.
Comment 15 Erwin Dokter 2014-02-20 11:52:25 UTC
I also have a problem understanding 'dirty diffs'. Where does parsoid get its HTML from? I thought it converted wikitext to HTML and then back. So how does the parser even come into play? Is Tidy also involved? Sorry to ask al this, but I simply need a basic understanding of parsoid.

FYI, the nowrap has been completely removed from hlist, so there are no CSS hacks to deal with. But any locally applied nowrap may cause errors, so the linebreaks must remain.

I can't help but feel that this issue is a major blocker for any parser improvement in the future, so I want to understand exactly where the problem exists and come up with a solution that will prevent such issues in the future.
Comment 16 ssastry 2014-02-20 20:39:06 UTC
Parsoid converts wikitext to HTML and back. Tidy does not enter the picture at all in the Parsoid pipeline.

There are two separate issues:

1. Parsoid uses the same set of PHP parser tests to make sure that Parsoid generates HTML that doesn't break existing wikitext. It uses PHP HTML to serialize back to wikitext. The extra newline that PHP parser emits introduces extra newlines in wikitext. "a\n\n*b" instead of "a\n*b". So, the extra newline is a "dirty diff". This now obscures our regular testing because we blacklist these failing tests and then try to compare changes in HTML output. So, not entirely unworkable, but requires greater diligence to monitor regressions during development. Ideally, we wouldn't have these failures. Alternatively, we have to fix Parsoid's HTML to wikitext conversion to ignore the extra newlines when they come from the parser's insertion rather than from source -- that means extra bookkeeping.

2. If you want Parsoid HTML to have the same CSS semantics like PHP HTML, Parsoid would also have to start adding these newlines after <ul> and before </ul>. These extra newline characters would now have to be accounted for in the DSR calculations (DSR maps a range to wikitext source to a DOM subtree, ex: wikitext substring [3,9] generate <p>foobar</p> HTML) which is also additional bookkeeping. DSR calculations has been carefully tweaked to account for every single character of wikitext so that Parsoid can faithfully output original wikitext on unedited portions of the HTML. So, Parsoid would have to ignore the extra inserted newlines in its calculations. Without accounting for them, Parsoid will generate different wikitext on conversion which manifests as "dirty diffs". Again, all of this extra bookkeeping can be done.

But, if we can avoid the extra newlines, the extra work can be avoided.
Comment 17 ssastry 2014-02-20 20:42:53 UTC
(In reply to Erwin Dokter from comment #15)

> I can't help but feel that this issue is a major blocker for any parser
> improvement in the future, so I want to understand exactly where the problem
> exists and come up with a solution that will prevent such issues in the
> future.

This is not a general blocker. Parser changes can be done and have been done, but changes have to be considered carefully since there are a lot of dependent users, and that is what you are seeing here in this bug report discussion.
Comment 18 Erwin Dokter 2014-02-20 21:01:32 UTC
Re: "a\n\n*b"

I don't understand where the extra newline comes from, as the parser outputs only one. Regardless... perhaps we need to go about this another way.

There should be a simple set of rules that govern how parser and parsoid interact, for example:

* block level elements (including list itema) must be terminated with e newline
* all other (inline) elements should not

Would that ease the situation?
Comment 19 Gabriel Wicke 2014-02-20 22:44:24 UTC
(In reply to Erwin Dokter from comment #18)
> * block level elements (including list itema) must be terminated with e
> newline
> * all other (inline) elements should not
> 
> Would that ease the situation?

This would in fact complicate the situation. We don't insert newlines in HTML where there were none in the original wikitext. This preserves the semantics independent of the CSS whitespace styling (remember that 'pre' is a possible value) and helps us to round-trip wikitext correctly without introducing spurious newlines.

As discussed in bug 39617, the newlines don't seem to be necessary any more now that the nowrap requirement is gone. Taking into account the cost they have for Parsoid development, I'd recommend removing the code that inserts them from the PHP parser.
Comment 20 Erwin Dokter 2014-02-20 23:12:20 UTC
OK, this discussion is not going to end in the next couple of hundred years, partly because I will never understand what is really going on. I still don't know where parsoid is getting its HTML from. Can't be from the browser DOM; that has gone through Tidy. 

Revert the patch.

I just want the parser (as long as it lasts) to emit some sanely formatted lists, just as it outputs sanely formatted tables. Is that too much to ask?

If I can make the parser to make a clean break *only* between </li> and <li>, without any linebreak before or after <ul> and </ul>, would that make parsoid happy?
Comment 21 ssastry 2014-02-20 23:42:09 UTC
(In reply to Erwin Dokter from comment #20)
> OK, this discussion is not going to end in the next couple of hundred years,
> partly because I will never understand what is really going on. 

Do you want to pop onto IRC (#mediawiki-parsoid) sometime and chat with us about this? We can then post the summary here after that. It might be simpler to figure this out in a more interactive discussion than is possible via bugzilla comments. I am going to be signing off for the day shortly, but the IRC channel is usually active between 9 am PST - 4 pm PST when most of us are around.

> I still
> don't know where parsoid is getting its HTML from. Can't be from the browser
> DOM; that has gone through Tidy. 

No, Parsoid generates the HTML as I indicated in Comment 16. In the case of Visual Editor use, VE then sends that HTML (alongwith any user edits) to Parsoid to convert back to wikitext.

Since Parsoid HTML will replace PHP parser HTML at some point in the foreseeable future (this year possibly), if anything requires these newlines to be present for functionality, Parsoid will also have to add them and that is what we are complaining about so far.
Comment 22 Erwin Dokter 2014-02-20 23:51:41 UTC
Ah, so it is only the *tests* that are affected. I think I understand now.

Still my question remains: can I make the parser make a clean break *only* between </li> and <li> (as that is the core issue affecting nowrap), without any linebreak before or after <ul> and </ul>?
Comment 23 ssastry 2014-02-21 06:13:07 UTC
(In reply to Erwin Dokter from comment #22)
> Ah, so it is only the *tests* that are affected. I think I understand now.

Not just tests. See part (2) in Comment 16.

> Still my question remains: can I make the parser make a clean break *only*
> between </li> and <li> (as that is the core issue affecting nowrap), without
> any linebreak before or after <ul> and </ul>?

That is already the case since wikitext has newlines between list items. See below for Parsoid's output.

[subbu@earth tests] echo "*a\n*b\n*c" | node parse
<body data-parsoid='{"dsr":[0,9,0,0]}'><ul data-parsoid='{"dsr":[0,8,0,0]}'><li data-parsoid='{"dsr":[0,2,1,0]}'>a</li>
<li data-parsoid='{"dsr":[3,5,1,0]}'>b</li>
<li data-parsoid='{"dsr":[6,8,1,0]}'>c</li></ul>
</body>
Comment 24 Erwin Dokter 2014-02-21 15:08:44 UTC
Sorry, but it isn't. The *old* PHP parser output is:

<ul><li>a
</li><li>b
</li><li>c
</li></ul>

No linebreak *between* list items. Parsoid goes (if I understand above correctly):

<ul><li>a</li>
<li>b</li>
<li>c</li></ul>

I want the PHP parser to behave the smae.
Comment 25 Gabriel Wicke 2014-02-21 23:48:56 UTC
Parsoid does not insert newlines where there are none:

<ul><li>foo</li><li>bar</li></ul>

in wikitext becomes 

<ul><li>foo</li><li>bar</li></ul>

It does preserve newlines from wikitext though:

* foo 
* bar
* baz

in wikitext becomes

<ul><li> foo</li>
<li> bar</li>
<li> baz</li></ul>

PHP + tidy inserts newlines in the first case too, which is broken.
Comment 26 Bartosz Dziewoński 2014-02-22 00:08:07 UTC
(In reply to Gabriel Wicke from comment #25)
> PHP + tidy inserts newlines in the first case too, which is broken.

I'm pretty sure that this is the case only if Tidy is enabled, and that kind of behavior is one of the many reasons why I want to kill it. (I really do not know why no one from the Parsoid team shares my sentiment about this.)
Comment 27 Gerrit Notification Bot 2014-05-20 16:22:16 UTC
Change 134380 had a related patch set uploaded by GWicke:
Update list item newline handling to follow Parsoid's model

https://gerrit.wikimedia.org/r/134380
Comment 28 Gabriel Wicke 2014-05-20 16:55:41 UTC
(In reply to Bartosz Dziewoński from comment #26)
> (In reply to Gabriel Wicke from comment #25)
> > PHP + tidy inserts newlines in the first case too, which is broken.
> 
> I'm pretty sure that this is the case only if Tidy is enabled, and that kind
> of behavior is one of the many reasons why I want to kill it. (I really do
> not know why no one from the Parsoid team shares my sentiment about this.)

Believe me, we want to kill it just as much as you do. And we have been working hard on it for 2+ years now.
Comment 29 Gerrit Notification Bot 2014-06-09 18:13:17 UTC
Change 134380 merged by jenkins-bot:
Update list item newline handling to follow Parsoid's model

https://gerrit.wikimedia.org/r/134380
Comment 30 Gerrit Notification Bot 2014-06-13 17:18:56 UTC
Change 94443 abandoned by Cscott:
Revert "Have list items occupy their own line".

Reason:
Abandoned in favor of Ib7aa9449bbd994cb23b83b3f23cff944b1cddadf

https://gerrit.wikimedia.org/r/94443
Comment 31 ssastry 2014-07-08 17:57:47 UTC
This has since been fixed in core.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links