Last modified: 2014-06-13 16:54:24 UTC
This breaks our XML parse when we try to detect images for the saved pages feature in the Android Wikipedia app. example from yesterdays main page: <div style="float:right;margin-left:0.5em;"><a href="/wiki/File:Maria_Sharapova,_December_2008.jpg" class="image" title="Maria Sharapova"> <img alt="Maria Sharapova in 2008" src="//upload.wikimedia.org/wikipedia/en/thumb/c/c6/Maria_Sharapova%2C_December_2008.jpg/61px-Maria_Sharapova%2C_December_2008.jpg" width="61" height="100" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/c/c6/Maria_Sharapova%2C_December_2008.jpg/91px-Maria_Sharapova%2C_December_2008.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/c/c6/Maria_Sharapova%2C_December_2008.jpg/121px-Maria_Sharapova%2C_December_2008.jpg 2x" data-file-width="405" data-file-height="667"></a></div> The img tag should end in data-file-height="667"/>
Prioritization and scheduling of this bug is tracked on Trello card https://trello.com/c/XeXwx61H
We output HTML5 and thus it's not supposed to be a valid XML. Any alternative parsers around?
Does adding closing tags prevent it from being HTML5 compliant?
Other pages have closed tags, only the main page is different.
Isn't this an issue with the parser? I think assuming it is XML is not very future proof. Hopefully one day all of Wikipedia will be HTML5 which doesn't need closed tags.
Just to clarify, self-closing img tags are not necessary in HTML5, but they are allowed and considered valid: http://dev.w3.org/html5/spec-author-view/syntax.html#syntax-start-tag
Are you using libxml2 here? There should be an HTML parsing mode which groks the implied-closed img elements I think
Main page is different because we rewrite it for mobile, but there's no guarantee that other pages will not be processed too e.g. for image removal or Zero.
I'll retract that issue from and Android app point of view since we switched away from using an XML parser. Using Html.fromHtml() with a custom Html.ImageGetter.