Last modified: 2014-07-07 20:30:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T52122, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 50122 - Parsoid renders trailing newlines in tables as paragraphs, making ptwiki infoboxes huge
Parsoid renders trailing newlines in tables as paragraphs, making ptwiki info...
Status: VERIFIED FIXED
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Parsoid Team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-24 19:06 UTC by Helder
Modified: 2014-07-07 20:30 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
http://s24.postimg.org/l4s9um8s5/userbox.jpg (65.20 KB, image/jpeg)
2013-06-24 19:08 UTC, Helder
Details

Description Helder 2013-06-24 19:06:20 UTC
As reported on
https://pt.wikipedia.org/wiki/Project:Editor_Visual/Coment%C3%A1rios?oldid=36203746#Infobox
there is too much space in the infobox when we open
https://pt.wikipedia.org/wiki/Albert_Einstein?veaction=edit
(tested on Google Chrome 28.0.1500.52 and Firefox 21.0, using Linux Mint 15).


Maybe this is a problema with a local template or CSS, but I was unable to figure that out by inpecting the HTML code...

The problem do not happen on enwiki:
https://en.wikipedia.org/wiki/Albert_Einstein?veaction=edit
Comment 2 James Forrester 2013-06-24 19:23:27 UTC
This is probably due to CE's tinkering with templates so that they float correctly; the scale of the change is I imagine a template-specific issue, but it would be good to confirm.
Comment 3 Bartosz Dziewoński 2013-06-24 19:29:56 UTC
Actually, this looks like VE is inserting an additional empty cell at the beginning of every infobox table row.
Comment 4 James Forrester 2013-06-24 19:35:40 UTC
(In reply to comment #3)
> Actually, this looks like VE is inserting an additional empty cell at the
> beginning of every infobox table row.

s/VE/Parsoid/g ?
Comment 5 Roan Kattouw 2014-07-01 00:26:50 UTC
This is now looking much better, but it's still not fixed. There's lots of whitespace added at the bottom, causing the infobox to be much much taller than it should be.

Moving to Parsoid: the table is also too big in http://parsoid-lb.eqiad.wikimedia.org/ptwiki/Albert_Einstein?oldid=39283789

This happens because, in the expansion of the template, lots and lots of newlines (more specifically, the pattern newline-newline-space or sometimes newline-newline-newline-space) are produced. You can see these in http://pt.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvexpandtemplates&format=yamlfm&titles=Albert_Einstein (search for "\n\n\n \n\n"). This seems to be due to templates like https://pt.wikipedia.org/w/index.php?title=Predefini%C3%A7%C3%A3o:Info&action=edit with lots of if statements (in the 200s) that leak newlines.

The result is a table that looks like:
{|
|-
| Real || rows || here
|-
....
|-
| Last || row || here
|-
[lots of newlines]
|}

The wikitext parser ignores these newlines, but Parsoid converts them all to paragraphs inside of a row with a single cell. In this case, this row contains about 232 paragraphs and has a total height of 5615px.

I haven't been able to reproduce this locally though. Parsoid seems to quash the newlines just fine, even if I try to generate them with {{echo}}.
Comment 6 ssastry 2014-07-02 22:47:21 UTC
(In reply to Roan Kattouw from comment #5)
> I haven't been able to reproduce this locally though. Parsoid seems to quash
> the newlines just fine, even if I try to generate them with {{echo}}.

This might be some weird tokenizer thing ... after a lot of trial-and-error, I have a reduced test case where the tokenizer output subtly changes beyond a threshold of newlines ... I'll continue on this later, but with a wikitext that is 136 lines long (including table closing, ending, and other tags), the tokenizer just generates a tr-tag without a td-tag => no p-br mess. At 138 lines long, the tokenizer seems to generate a td-tag after which then induces p-br wrapping since all the newlines are considered "content".

https://gist.githubusercontent.com/subbuss/32c0fe97734381e1f026/raw/58c2815289a24a42753da14085518696bddb57e9/gistfile1.txt has the p-br .. now try removing 2 of the empty lines at the beginning or end. and boom, the p-brs disappear.

To be continued ... running out now ...
Comment 7 C. Scott Ananian 2014-07-02 22:49:26 UTC
Perhaps we're hitting a limitation in our lookahead and/or backtracking?
Comment 8 ssastry 2014-07-03 04:27:05 UTC
Line 224 of mediawiki.tokenizer.utils.js (which is a heuristic that looks upto 200 chars ahead to determine if we have non-whitespace content) is the problem. If the table has more than 200 chars whitespace, this effectively means that the tokenizer assumes that valid content follows and inserts a <td> there. 

In this context, another relevant consideration is Parsoid's implicit-td-insertion behavior (which is what is kicking in above). I don't know why we have this behavior. See my comments around line 5500 of https://gerrit.wikimedia.org/r/#/c/133957/2/tests/parser/parserTests.txt (which is the WIP patch to add tidy support to parser tests).

So, two things to check here:
(1) why do we have implicit-td-insertion behavior?
(2) what is the best way to fix the 200-char lookahead issue in the inline_breaks test in the tokenizer?

Fixing either one will fix the problem for this test case.
Comment 9 Gerrit Notification Bot 2014-07-03 21:05:32 UTC
Change 144061 had a related patch set uploaded by Subramanya Sastry:
(Bug 50122) Removed fragile lookahead in inline_breaks + improve perf.

https://gerrit.wikimedia.org/r/144061
Comment 10 Gerrit Notification Bot 2014-07-03 23:19:58 UTC
Change 144061 merged by jenkins-bot:
(Bug 50122) Removed fragile lookahead in inline_breaks + improve perf.

https://gerrit.wikimedia.org/r/144061
Comment 11 ssastry 2014-07-07 20:06:11 UTC
This fix is now deployed. The implicit-td-insertion behavior will be investigated separately as part of tidy support work.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links