Last modified: 2014-10-20 15:26:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74257, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72257 - Unescaped characters in XML format response in recentchange API
Unescaped characters in XML format response in recentchange API
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-20 14:04 UTC by bianjiang
Modified: 2014-10-20 15:26 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
there is an example ~line56 (124.59 KB, text/xml)
2014-10-20 14:05 UTC, bianjiang
Details

Description bianjiang 2014-10-20 14:04:22 UTC
we are "wget" recentchange api, and parse the response in XML format. characters like "&", double-quote are not properly escaped in the response. It happened since ~1 hour ago.

example API request (actually it happens on all language sites)
(attachment is a copy, ~line 56 you can find an example. also you can also reproduce it by changing an article with "&" in title)


http://en.wikipedia.org/w/api.php?action=query&format=xml&continue=&list=recentchanges&rclimit=500&rcnamespace=0%7C2%7C4%7C6%7C10%7C14%7C100%7C118%7C828&rcprop=comment%7Cflags%7Cids%7Cloginfo%7Csizes%7Ctimestamp%7Ctitle%7Cuser

and snippet from the response.
<rc type="edit" ns="10" title="Template:Infobox vanadium" pageid="10586283" revid="630373828" old_revid="619432401" rcid="687830391" user="DePiep" oldlen="3074" newlen="3043" timestamp="2014-10-20T13:47:28Z" comment="isotopes table:rm double linking, use en-dash for "not existing" (not hyphen), replaced: [[Positron emission ��� [[positron emission, above=- ��� above=��� using [[Project:AWB|AWB]]"/>
<rc type="external" ns="0" title="Commiphora africana" pageid="37123832" revid="611837345" old_revid="611837345" rcid="687830417" user="EmausBot" anon="" bot="" minor="" oldlen="5607" newlen="5607" timestamp="2014-10-20T13:47:26Z" comment=""/>
<rc type="external" ns="14" title="Category:Cyberwarfare" pageid="28844523" revid="551240694" old_revid="551240694" rcid="687830416" user="Bjankuloski06" anon="" minor="" oldlen="191" newlen="191" timestamp="2014-10-20T13:47:26Z" comment=""/>

<rc type="external" ns="0" title="Outlaw Gentlemen & Shady Ladies" pageid="38424498" revid="629055833" old_revid="629055833" rcid="687830199" user="Dexbot" anon="" bot="" minor="" oldlen="20380" newlen="20380" timestamp="2014-10-20T13:45:37Z" comment=""/>
<rc type="external" ns="0" title="Dacryodes edulis" pageid="10635487" revid="620387604" old_revid="620387604" rcid="687830198" user="Lockal" anon="" minor="" oldlen="7899" newlen="7899" timestamp="2014-10-20T13:45:37Z" comment=""/>
<rc type="edit" ns="4" title="Wikipedia:Requests for mediation/Pending" pageid="5917449" revid="630373641" old_revid="630057593" rcid="687830139" user="MediationBot" oldlen="1201" newlen="1145" timestamp="2014-10-20T13:45:35Z" comment="Updating pending case list, 2 listed. Errors? [[User:MediationBot/shutoff/MedComClerk]]"/>
<rc type="edit" ns="0" title="Muhammad Nawaz Sharif University of Engineering & Technology, Multan" pageid="44157005" revid="630373640" old_revid="630371603" rcid="687830138" user="Umarkhaldoon" minor="" oldlen="4387" newlen="4441" timestamp="2014-10-20T13:45:35Z" comment="/* Campuses */"/>
Comment 1 bianjiang 2014-10-20 14:05:07 UTC
Created attachment 16817 [details]
there is an example ~line56
Comment 2 Brad Jorsch 2014-10-20 14:13:02 UTC
I cannot reproduce this; all data quotes, ampersands, and so on are escaped. I note that attachment 16817 [details] has obviously been post-processed to some extent, as the MediaWiki API does not include newlines between tags for format=xml.

If possible, please provide a link that reliably reproduces the issue, as list=recentchanges without rcstart/rcend is not going to be valid for more than a minute or two on enwiki. Can you do the same with e.g. https://en.wikipedia.org/w/api.php?format=xml&action=query&list=usercontribs&ucuser=DePiep&ucstart=2014-10-20T13:48:00Z&uclimit=1 ? Also, please upload the exact response you are receiving from the API, without any post-processing.
Comment 4 Brad Jorsch 2014-10-20 14:23:02 UTC
After downloading that link, I see 53 instances of &#039;, 5 of &amp;, 3 of &gt;, 3 of &lt;, and 24 of &quot;. There are no other ampersands in the file.

I note that is another list=recentchanges link that does not use rcstart.
Comment 5 bianjiang 2014-10-20 14:31:23 UTC
you're right. not seeing the problem now. maybe a false alarm.

Thanks for debugging anyway!
Comment 6 anthonyzhang 2014-10-20 15:13:42 UTC
I can still see the buggy XML from this link:
http://en.wikipedia.org/w/api.php?action=query&format=xml&continue=&list=recentchanges&rcend=2014-10-20T13:26:16Z&rclimit=500&rcnamespace=0%7C2%7C4%7C6%7C10%7C14%7C100%7C118%7C828&rcprop=comment%7Cflags%7Cids%7Cloginfo%7Csizes%7Ctimestamp%7Ctitle%7Cuser&rcstart=2014-10-20T13:26:16Z
The returned XML are: 
<api>
<query>
<recentchanges>
<rc type="external" ns="0" title="Monte Duida tree frog" pageid="12376224" revid="594167748" old_revid="594167748" rcid="687827550" user="Lockal" anon="" minor="" oldlen="1199" newlen="1199" timestamp="2014-10-20T13:26:16Z" comment=""/>
<rc type="edit" ns="0" title="Kuala Lumpur International Airport" pageid="105963" revid="630371813" old_revid="630328242" rcid="687827524" user="Egard89" minor="" oldlen="74091" newlen="73846" timestamp="2014-10-20T13:26:16Z" comment="Undid revision 630328242 by [[Special:Contributions/Mohamad Aliff Shafiq Azizan|Mohamad Aliff Shafiq Azizan]] ([[User talk:Mohamad Aliff Shafiq Azizan|talk]])"/>
<rc type="edit" ns="0" title="Arash "AJ" Maddah" pageid="43501773" revid="630371812" old_revid="624205432" rcid="687827523" user="Waacstats" oldlen="1183" newlen="1206" timestamp="2014-10-20T13:26:16Z" comment="/* References */Add persondata short description using [[Project:AWB|AWB]]"/>
</recentchanges>
</query>
</api>
The bug is:  title="Arash "AJ" Maddah" pageid="43501773"
Comment 7 Brad Jorsch 2014-10-20 15:17:27 UTC
(In reply to anthonyzhang from comment #6)
> The bug is:  title="Arash "AJ" Maddah" pageid="43501773"

Looks fine to me when following that link. Note that web browsers such as Firefox and Chrome decode at least some entities in their prettified rendering of XML documents; view-source or downloading the actual file with curl or wget and viewing in a text editor shows the proper encoded entities.
Comment 8 anthonyzhang 2014-10-20 15:26:58 UTC
Yes, you are right. Thanks for the explanation!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links