Last modified: 2014-10-30 14:22:25 UTC
Not sure if this is a bug but for example: wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json" produces: {"parse":{"title":"Abb\u00e9 Pr\u00e9vost","links":[{"ns":0,"*":"Antoine Fran\u00e7ois Pr\u00e9vost","exists":""}]}} According to http://www.fileformat.info/info/unicode/char/e9/index.htm ..the "\u00e9" is Unicode é produced by C/C++/Java. For an API this means I need to translate and I don't have an easy way. Should the API produce the character é not Java/C++/Python encoded? Regards, GreenC
"\u00e9" is also produced by JavaScript and other ECMAScript implementations. Your JSON decoder should be handling it for you; if you're writing your own JSON decoder, it will need to handle such escapes. That said, if you supply the utf8 option to format=json,[1] most characters will be returned unescaped. You will still see escapes for certain characters, though, such as double-quote and newline. [1]: http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json&utf8=1
(In reply to Brad Jorsch from comment #1) > "\u00e9" is also produced by JavaScript and other ECMAScript > implementations. Brad, I think you meant "correctly parsed" instead of "produced". JavaScript's JSON.stringify() won't escape that character. > Your JSON decoder should be handling it for you; if you're > writing your own JSON decoder, it will need to handle such escapes. Yes, see <http://tools.ietf.org/html/rfc7159#section-7>.
I'm using a language (awk) with no native UTF or JSON support so found it needs to pipe through the unix utility iconv eg. > echo '\u00E9' | iconv -f java > é However &utf8=1 is awesome. That saved me from doing the above external program. The link to ietf.org is helpful.. I tried it with the Wikipedia article named "300" (includes quotes): wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page="'"'"300"'"'"&prop=links&format=json&utf8=1" produces {"parse":{"title":"\"300\"","links": etc.. So it escapes not in UTF-16 Java format but plain backslash. That should make life easier. GreenC