Last modified: 2014-10-30 14:22:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74734, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 72734 - Unicode characters in API output


Summary:	Unicode characters in API output

Status:	RESOLVED WORKSFORME

Product:	MediaWiki
Classification:	Unclassified
Component:	API (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-10-30 04:11 UTC by nejuje6tpztluvolq
Modified:	2014-10-30 14:22 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description nejuje6tpztluvolq 2014-10-30 04:11:07 UTC

Not sure if this is a bug but for example:

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json"

produces:

{"parse":{"title":"Abb\u00e9 Pr\u00e9vost","links":[{"ns":0,"*":"Antoine Fran\u00e7ois Pr\u00e9vost","exists":""}]}}

According to 
http://www.fileformat.info/info/unicode/char/e9/index.htm

..the "\u00e9" is Unicode é produced by C/C++/Java. For an API this means I need to translate and I don't have an easy way. Should the API produce the character é not Java/C++/Python encoded? 

Regards,
GreenC

Comment 1 Brad Jorsch 2014-10-30 13:38:06 UTC

"\u00e9" is also produced by JavaScript and other ECMAScript implementations. Your JSON decoder should be handling it for you; if you're writing your own JSON decoder, it will need to handle such escapes.

That said, if you supply the utf8 option to format=json,[1] most characters will be returned unescaped. You will still see escapes for certain characters, though, such as double-quote and newline.


 [1]: http://en.wikipedia.org/w/api.php?action=parse&page=abbé_Prévost&prop=links&format=json&utf8=1

Comment 2 Kevin Israel (PleaseStand) 2014-10-30 13:57:54 UTC

(In reply to Brad Jorsch from comment #1)
> "\u00e9" is also produced by JavaScript and other ECMAScript
> implementations.

Brad, I think you meant "correctly parsed" instead of "produced". JavaScript's JSON.stringify() won't escape that character.

> Your JSON decoder should be handling it for you; if you're
> writing your own JSON decoder, it will need to handle such escapes.

Yes, see <http://tools.ietf.org/html/rfc7159#section-7>.

Comment 3 nejuje6tpztluvolq 2014-10-30 14:22:25 UTC

I'm using a language (awk) with no native UTF or JSON support so found it needs to pipe through the unix utility iconv eg.

> echo '\u00E9' | iconv -f java
> é

However &utf8=1 is awesome. That saved me from doing the above external program. 

The link to ietf.org is helpful.. I tried it with the Wikipedia article named "300" (includes quotes):

wget -q -O- "http://en.wikipedia.org/w/api.php?action=parse&page="'"'"300"'"'"&prop=links&format=json&utf8=1"

produces

{"parse":{"title":"\"300\"","links": etc..

So it escapes not in UTF-16 Java format but plain backslash. That should make life easier.
 
GreenC

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links