Last modified: 2014-10-30 14:22:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74734, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72734 - Unicode characters in API output
Unicode characters in API output
Status: RESOLVED WORKSFORME
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-30 04:11 UTC by nejuje6tpztluvolq
Modified: 2014-10-30 14:22 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description nejuje6tpztluvolq 2014-10-30 04:11:07 UTC
Not sure if this is a bug but for example:

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json"

produces:

{"parse":{"title":"Abb\u00e9 Pr\u00e9vost","links":[{"ns":0,"*":"Antoine Fran\u00e7ois Pr\u00e9vost","exists":""}]}}

According to 
http://www.fileformat.info/info/unicode/char/e9/index.htm

..the "\u00e9" is Unicode é produced by C/C++/Java. For an API this means I need to translate and I don't have an easy way. Should the API produce the character é not Java/C++/Python encoded? 

Regards,
GreenC
Comment 1 Brad Jorsch 2014-10-30 13:38:06 UTC
"\u00e9" is also produced by JavaScript and other ECMAScript implementations. Your JSON decoder should be handling it for you; if you're writing your own JSON decoder, it will need to handle such escapes.

That said, if you supply the utf8 option to format=json,[1] most characters will be returned unescaped. You will still see escapes for certain characters, though, such as double-quote and newline.


 [1]: http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json&utf8=1
Comment 2 Kevin Israel (PleaseStand) 2014-10-30 13:57:54 UTC
(In reply to Brad Jorsch from comment #1)
> "\u00e9" is also produced by JavaScript and other ECMAScript
> implementations.

Brad, I think you meant "correctly parsed" instead of "produced". JavaScript's JSON.stringify() won't escape that character.

> Your JSON decoder should be handling it for you; if you're
> writing your own JSON decoder, it will need to handle such escapes.

Yes, see <http://tools.ietf.org/html/rfc7159#section-7>.
Comment 3 nejuje6tpztluvolq 2014-10-30 14:22:25 UTC
I'm using a language (awk) with no native UTF or JSON support so found it needs to pipe through the unix utility iconv eg.

> echo '\u00E9' | iconv -f java
> é

However &utf8=1 is awesome. That saved me from doing the above external program. 

The link to ietf.org is helpful.. I tried it with the Wikipedia article named "300" (includes quotes):

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page="'"'"300"'"'"&prop=links&format=json&utf8=1"

produces

{"parse":{"title":"\"300\"","links": etc..

So it escapes not in UTF-16 Java format but plain backslash. That should make life easier.
 
GreenC

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links