Last modified: 2011-12-18 17:08:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34753, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32753 - action=parse does not detect headers in templates correctly
action=parse does not detect headers in templates correctly
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.18.x
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-01 17:46 UTC by Giftpflanze
Modified: 2011-12-18 17:08 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Giftpflanze 2011-12-01 17:46:41 UTC
Bug detected at: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=section

Two 3rd level headers are embedded in a template call, the parsed results are messed up:

Byteoffsets from number=1.1 on give end-of-page offset. There is no index and no fromtitle.

May have the same cause as 25203#c3 (“The api isn't at fault here, its only displaying what the parser output says there is.”).
Comment 1 Brion Vibber 2011-12-01 18:54:14 UTC
Hmm...

{
	"warnings": {
		"parse": {
			"*": "Unrecognized value for parameter 'prop': section"
		}
	},
	"parse": {
		"title": "Benutzer Diskussion:Label5"
	}
}
Comment 2 Giftpflanze 2011-12-01 19:11:34 UTC
(In reply to comment #1)
Awww, somehow the s in the end of the URL got lost. Correct link: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=sections
Comment 3 Brion Vibber 2011-12-01 21:46:52 UTC
Ok I can confirm your results there.

The first two sections (first one is 'regular', second is in the templated text):

			{
				"toclevel": 1,
				"level": "2",
				"line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner Benutzer-Diskussionsseite",
				"number": "1",
				"index": "1",
				"fromtitle": "Benutzer_Diskussion:Label5",
				"byteoffset": 3417,
				"anchor": "Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite"
			},
			{
				"toclevel": 2,
				"level": "3",
				"line": "Meine WP-W\u00fcnsche f\u00fcr 2011",
				"number": "1.1",
				"index": "",
				"fromtitle": false,
				"byteoffset": 7897,
				"anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011"
			},

Since this second one comes from within a template, the current parser can't really assign it a byte position within the article text. I'm not too familiar with how this output is generated so will have to take a peek to say more. Ideally it at least shouldn't mess up the later sections, but I'm not sure how a "byteoffset" helps if you don't have a "bytelength"... possibly this is just a bad data structure that's not really suitable for how sections are handled. :(
Comment 4 Giftpflanze 2011-12-02 14:34:24 UTC
(In reply to comment #3)
> Since this second one comes from within a template, the current parser can't
> really assign it a byte position within the article text. I'm not too familiar
> with how this output is generated so will have to take a peek to say more.
> Ideally it at least shouldn't mess up the later sections, but I'm not sure how
> a "byteoffset" helps if you don't have a "bytelength"... possibly this is just
> a bad data structure that's not really suitable for how sections are handled.
> :(

Why is it actually called byteoffset when it is a character offset and not a byte offset? I propose renaming it to charoffset, maybe. I understand that the parser has no notion of sections in templates, I don't really care. But what I care about is the byteoffsets. Or actually where a section starts (and then implicitly where it ends), so that I can take them apart.
Comment 5 Giftpflanze 2011-12-02 16:16:35 UTC
Does not only affect templates but also tables: Benutzer Diskussion:Caliban@dewiki. And <div> elements: Benutzer Diskussion:Elchbauer@dewiki. And parser functions: Benutzer Diskussion:4Frankie@dewiki.
Comment 6 DrTrigon 2011-12-18 17:08:36 UTC
Can confirm this bug on de:wiki 1.18mwf e.g. on(In reply to comment #3)
> Ok I can confirm your results there.
> 
> The first two sections (first one is 'regular', second is in the templated
> text):
> 
>             {
>                 "toclevel": 1,
>                 "level": "2",
>                 "line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner
> Benutzer-Diskussionsseite",
>                 "number": "1",
>                 "index": "1",
>                 "fromtitle": "Benutzer_Diskussion:Label5",
>                 "byteoffset": 3417,
>                 "anchor":
> "Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite"
>             },
>             {
>                 "toclevel": 2,
>                 "level": "3",
>                 "line": "Meine WP-W\u00fcnsche f\u00fcr 2011",
>                 "number": "1.1",
>                 "index": "",
>                 "fromtitle": false,
>                 "byteoffset": 7897,
>                 "anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011"
>             },
> 
> Since this second one comes from within a template, the current parser can't
> really assign it a byte position within the article text. I'm not too familiar
> with how this output is generated so will have to take a peek to say more.
> Ideally it at least shouldn't mess up the later sections, but I'm not sure how
> a "byteoffset" helps if you don't have a "bytelength"... possibly this is just
> a bad data structure that's not really suitable for how sections are handled.
> :(

The point is in the byteoffset field should be a "" in order to be correct recognized e.g. by DrTrigonBot. Look at [1] there you have e.g.

 index="T-7" byteoffset=""

for all template entries, except the level 3 headings were you get e.g.

 index="" byteoffset="137405"

which confuses my bot a little bit! My workaround is to catch the empty index string, but since this is considered to be a bug I cannot rely on the fact that there will always be an empty index string...

[1] http://de.wikipedia.org/w/api.php?action=parse&page=Wikipedia:L%C3%B6schkandidaten/12.%20Dezember%202009&prop=sections

Greetings

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links