Last modified: 2014-08-20 12:09:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69410, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67410 - UnicodeDecodeError in reflinks.py
UnicodeDecodeError in reflinks.py
Status: RESOLVED FIXED
Product: Pywikibot
Classification: Unclassified
General (Other open bugs)
core-(2.0)
All All
: Unprioritized normal
: ---
Assigned To: Pywikipedia bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-02 10:27 UTC by Beta16
Modified: 2014-08-20 12:09 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Beta16 2014-07-02 10:27:51 UTC
I received an error during execution of the script:

python pwb.py reflinks.py "-xml:itwiki-20140612-pages-meta-current.xml.bz2"


Traceback (most recent call last):
  File "pwb.py", line 153, in <module>
    run_python_file(fn, argv, argvu)
  File "pwb.py", line 67, in run_python_file
    exec(compile(source, filename, "exec"), main_mod.__dict__)
  File "scripts/reflinks.py", line 824, in <module>
    main()
  File "scripts/reflinks.py", line 821, in main
    bot.run()
  File "scripts/reflinks.py", line 691, in run
    ref.transform()
  File "scripts/reflinks.py", line 236, in transform
    self.title = pywikibot.html2unicode(self.title)
  File "/data/project/betabot/core/pywikibot/page.py", line 3632, in html2unicode
    result += text
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
Dropped throttle(s).
<type 'exceptions.UnicodeDecodeError'>

My version is:
Pywikibot: [https] r-pywikibot-core.git (e563873, g3466, 2014/07/02, 08:29:01, ok)
Release version: 2.0b1
Python: 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3]
unicode test: ok
Comment 1 xqt 2014-07-02 14:04:37 UTC
You you have any hints for the page title?
Comment 2 Beta16 2014-07-02 16:23:58 UTC
Sorry, the page is [[w:it:Dolomiti]]
Comment 3 xqt 2014-07-02 16:57:01 UTC
Where is that sign in the title:
>>> print unichr(0xe2)
â
Comment 4 Merlijn van Deen (test) 2014-07-02 17:08:14 UTC
import pywikibot
pywikibot.html2unicode('\xe2')


The issue is reflinks feeding non-unicode (i.e. str) data to html2unicode. Maybe an issue with xml parsing, maybe an issue with the xml files provided by the WMF.
Comment 5 xqt 2014-07-02 20:15:39 UTC
Then unicode2html(html2unicode(text)) should solve it.
Comment 6 Beta16 2014-07-07 13:38:30 UTC
The web page that causing the issue is: http://www.treccani.it/enciclopedia/mugo .
Probably a coding issue between utf-8 and ISO-8859-1/windows-1252, because with the proposed correction in comment #5 (if I understand correctly) "Mugo nell’Enciclopedia Treccani" become "Mugo nell’Enciclopedia Treccani"
Comment 7 Gerrit Notification Bot 2014-07-09 13:28:20 UTC
Change 144969 had a related patch set uploaded by Beta16:
reflinks.py - UnicodeDecodeError

https://gerrit.wikimedia.org/r/144969
Comment 8 John Mark Vandenberg 2014-07-09 14:52:38 UTC
I checked the URL  http://www.treccani.it/enciclopedia/mugo  with iconv , and it is OK.
Comment 9 John Mark Vandenberg 2014-07-13 00:30:13 UTC
(In reply to Merlijn van Deen from comment #4)
> import pywikibot
> pywikibot.html2unicode('\xe2')
> 
> 
> The issue is reflinks feeding non-unicode (i.e. str) data to html2unicode.

Stating that another way, .. it is giving html2unicode a str, which contains undecoded utf8.

In this case it is pretty normal HTML loading.

>>> pywikibot.html2unicode(pywikibot.comms.http.request(site=None, uri='http://www.treccani.it/enciclopedia/mugo/'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/jvanden3/projects/pywiki/gerrit/pywikibot/page.py", line 3625, in html2unicode
    result += text[:match.start()]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 414: ordinal not in range(128)

This happens when it is trying to decode &nbsp; in the HTML.

Adding text=text.decode('utf-8') at the top solves this specific problem, because the HTML of that URL contains:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Of course it would need to be wrapped in a try: except: block.

Decoding 'text' is a reasonable assumption for a method called html2unicode. ;-(  but how much of the html soup should a 'html2unicode' understand, especially when it is in the 'page' module.

If we're going to support proper deciphering of html in pywikibot (or do we have something like this already?), rather than expect each script to do it, the functionality should be in weblib, and it wouldnt hurt to use an existing library to do the grunt work.  However, the 'ignore' list capability in html2unicode, which is needed by cosmetic_changes.py is unlikely to be part of existing libraries.

The immediate solution is for reflinks.py to decode('utf-8') the fetched page, before sending it to html2unicode, so I have -1'd the patch.
Comment 10 Gerrit Notification Bot 2014-07-17 11:14:34 UTC
Change 144969 merged by jenkins-bot:
reflinks.py - UnicodeDecodeError

https://gerrit.wikimedia.org/r/144969
Comment 11 Amir Ladsgroup 2014-07-26 14:29:19 UTC
The patch is merged, so Is it still valid? May I close the bug?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links