Last modified: 2014-10-20 15:26:58 UTC
we are "wget" recentchange api, and parse the response in XML format. characters like "&", double-quote are not properly escaped in the response. It happened since ~1 hour ago. example API request (actually it happens on all language sites) (attachment is a copy, ~line 56 you can find an example. also you can also reproduce it by changing an article with "&" in title) http://en.wikipedia.org/w/api.php?action=query&format=xml&continue=&list=recentchanges&rclimit=500&rcnamespace=0%7C2%7C4%7C6%7C10%7C14%7C100%7C118%7C828&rcprop=comment%7Cflags%7Cids%7Cloginfo%7Csizes%7Ctimestamp%7Ctitle%7Cuser and snippet from the response. <rc type="edit" ns="10" title="Template:Infobox vanadium" pageid="10586283" revid="630373828" old_revid="619432401" rcid="687830391" user="DePiep" oldlen="3074" newlen="3043" timestamp="2014-10-20T13:47:28Z" comment="isotopes table:rm double linking, use en-dash for "not existing" (not hyphen), replaced: [[Positron emission ��� [[positron emission, above=- ��� above=��� using [[Project:AWB|AWB]]"/> <rc type="external" ns="0" title="Commiphora africana" pageid="37123832" revid="611837345" old_revid="611837345" rcid="687830417" user="EmausBot" anon="" bot="" minor="" oldlen="5607" newlen="5607" timestamp="2014-10-20T13:47:26Z" comment=""/> <rc type="external" ns="14" title="Category:Cyberwarfare" pageid="28844523" revid="551240694" old_revid="551240694" rcid="687830416" user="Bjankuloski06" anon="" minor="" oldlen="191" newlen="191" timestamp="2014-10-20T13:47:26Z" comment=""/> <rc type="external" ns="0" title="Outlaw Gentlemen & Shady Ladies" pageid="38424498" revid="629055833" old_revid="629055833" rcid="687830199" user="Dexbot" anon="" bot="" minor="" oldlen="20380" newlen="20380" timestamp="2014-10-20T13:45:37Z" comment=""/> <rc type="external" ns="0" title="Dacryodes edulis" pageid="10635487" revid="620387604" old_revid="620387604" rcid="687830198" user="Lockal" anon="" minor="" oldlen="7899" newlen="7899" timestamp="2014-10-20T13:45:37Z" comment=""/> <rc type="edit" ns="4" title="Wikipedia:Requests for mediation/Pending" pageid="5917449" revid="630373641" old_revid="630057593" rcid="687830139" user="MediationBot" oldlen="1201" newlen="1145" timestamp="2014-10-20T13:45:35Z" comment="Updating pending case list, 2 listed. Errors? [[User:MediationBot/shutoff/MedComClerk]]"/> <rc type="edit" ns="0" title="Muhammad Nawaz Sharif University of Engineering & Technology, Multan" pageid="44157005" revid="630373640" old_revid="630371603" rcid="687830138" user="Umarkhaldoon" minor="" oldlen="4387" newlen="4441" timestamp="2014-10-20T13:45:35Z" comment="/* Campuses */"/>
Created attachment 16817 [details] there is an example ~line56
I cannot reproduce this; all data quotes, ampersands, and so on are escaped. I note that attachment 16817 [details] has obviously been post-processed to some extent, as the MediaWiki API does not include newlines between tags for format=xml. If possible, please provide a link that reliably reproduces the issue, as list=recentchanges without rcstart/rcend is not going to be valid for more than a minute or two on enwiki. Can you do the same with e.g. https://en.wikipedia.org/w/api.php?format=xml&action=query&list=usercontribs&ucuser=DePiep&ucstart=2014-10-20T13:48:00Z&uclimit=1 ? Also, please upload the exact response you are receiving from the API, without any post-processing.
http://en.wikipedia.org/w/api.php?action=query&format=xml&continue=&list=recentchanges&rclimit=500&rcnamespace=0%7C2%7C4%7C6%7C10%7C14%7C100%7C118%7C828&rcprop=comment%7Cflags%7Cids%7Cloginfo%7Csizes%7Ctimestamp%7Ctitle%7Cuser&ucstart=2014-10-20T14:12:38Z search "&" in response.
After downloading that link, I see 53 instances of ', 5 of &, 3 of >, 3 of <, and 24 of ". There are no other ampersands in the file. I note that is another list=recentchanges link that does not use rcstart.
you're right. not seeing the problem now. maybe a false alarm. Thanks for debugging anyway!
I can still see the buggy XML from this link: http://en.wikipedia.org/w/api.php?action=query&format=xml&continue=&list=recentchanges&rcend=2014-10-20T13:26:16Z&rclimit=500&rcnamespace=0%7C2%7C4%7C6%7C10%7C14%7C100%7C118%7C828&rcprop=comment%7Cflags%7Cids%7Cloginfo%7Csizes%7Ctimestamp%7Ctitle%7Cuser&rcstart=2014-10-20T13:26:16Z The returned XML are: <api> <query> <recentchanges> <rc type="external" ns="0" title="Monte Duida tree frog" pageid="12376224" revid="594167748" old_revid="594167748" rcid="687827550" user="Lockal" anon="" minor="" oldlen="1199" newlen="1199" timestamp="2014-10-20T13:26:16Z" comment=""/> <rc type="edit" ns="0" title="Kuala Lumpur International Airport" pageid="105963" revid="630371813" old_revid="630328242" rcid="687827524" user="Egard89" minor="" oldlen="74091" newlen="73846" timestamp="2014-10-20T13:26:16Z" comment="Undid revision 630328242 by [[Special:Contributions/Mohamad Aliff Shafiq Azizan|Mohamad Aliff Shafiq Azizan]] ([[User talk:Mohamad Aliff Shafiq Azizan|talk]])"/> <rc type="edit" ns="0" title="Arash "AJ" Maddah" pageid="43501773" revid="630371812" old_revid="624205432" rcid="687827523" user="Waacstats" oldlen="1183" newlen="1206" timestamp="2014-10-20T13:26:16Z" comment="/* References */Add persondata short description using [[Project:AWB|AWB]]"/> </recentchanges> </query> </api> The bug is: title="Arash "AJ" Maddah" pageid="43501773"
(In reply to anthonyzhang from comment #6) > The bug is: title="Arash "AJ" Maddah" pageid="43501773" Looks fine to me when following that link. Note that web browsers such as Firefox and Chrome decode at least some entities in their prettified rendering of XML documents; view-source or downloading the actual file with curl or wget and viewing in a text editor shows the proper encoded entities.
Yes, you are right. Thanks for the explanation!