Last modified: 2013-11-09 21:39:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T57186, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 55186 - archivebot.py doesn't support unicode month names
archivebot.py doesn't support unicode month names
Status: RESOLVED FIXED
Product: Pywikibot
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Pywikipedia bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-05 04:39 UTC by Kunal Mehta (Legoktm)
Modified: 2013-11-09 21:39 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kunal Mehta (Legoktm) 2013-10-05 04:39:35 UTC
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1482/
Reported by: Anonymous user
Created on: 2012-06-30 17:50:30
Subject: archivebot.py doesn't support unicode month names
Original description:
archivebot.py doesn't work well with languages such as Turkish which has some months with unicode characters. Namely:

2 Şubat
4 Mayıs
8 Ağustos
9 Eylül
11 Kasım
12 Aralık
Comment 1 Kunal Mehta (Legoktm) 2013-10-05 04:39:37 UTC
Pywikipedia \[http\] trunk/pywikipedia \(r10432, 2012/06/30, 15:47:55\)
Python 2.7.3 \(default, Apr 10 2012, 23:31:26\) \[MSC v.1500 32 bit \(Intel\)\]
config-settings:
use\_api = True
use\_api\_login = True
unicode test: ok
Comment 2 Kunal Mehta (Legoktm) 2013-10-05 04:39:39 UTC
Command line I used was archivebot.py -l turkish Archive/config
Comment 3 Kunal Mehta (Legoktm) 2013-10-05 04:39:40 UTC
Could you give us a traceback or further informations about that bug? The bot uses the monthnames coming from mediaWiki messages and I don't know what is the significance of the locale setting. Could you try to run the bot without --locale=tr setting?
Comment 4 Kunal Mehta (Legoktm) 2013-10-05 04:39:42 UTC
Sure. There is no traceback error for me to provide though since the code does work, it just ignores some threads.

Run1: archivebot.py -l turkish Archive/config
Fetching template transclusions...
Getting references to \[\[Sablon:Archive/config\]\] via API...
Processing \[\[tr:Kullanici mesaj:??????\]\]
3 Threads found on \[\[tr:Kullanici mesaj:??????\]\]
Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\]
Processing 3 threads
There are only 0 Threads. Skipping

Run2: archivebot.py Archive/config
Fetching template transclusions...
Getting references to \[\[Sablon:Archive/config\]\] via API...
Processing \[\[tr:Kullanici mesaj:??????\]\]
3 Threads found on \[\[tr:Kullanici mesaj:??????\]\]
Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\]
Processing 3 threads
There are only 0 Threads. Skipping

Note the Turkish character ı is displayed as i in the CMD window \(I run code using Windows\). The ???? relate to my user talk page http://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1\_mesaj:%E3%81%A8%E3%81%82%E3%82%8B%E7%99%BD%E3%81%84%E7%8C%AB but CMD cannot display unicode.
Comment 5 Kunal Mehta (Legoktm) 2013-10-05 04:39:44 UTC
Oh when I ran the bot initially without -l turkish it ignored all threads. Since it already archived 3 of the 6 initial threads it is still reporting 0 Threads as it cannot see the ones with "Mayıs" month name.
Comment 6 Kunal Mehta (Legoktm) 2013-10-05 04:39:46 UTC
Looked into this a bit.

I've managed to isolate the problem to ~line 237 where all the txt2timestamp functions are. It seems that all of them are raising ValueErrors.
Comment 7 Kunal Mehta (Legoktm) 2013-10-05 04:39:47 UTC
Tried this:
import unicodedata

@line 237
_TM = ''.join((c for c in unicodedata.normalize('NFD', TM.group(0)) if unicodedata.category(c) != 'Mn'))

and then call txt2timestamp with _TM instead of TM.group(0)
Comment 8 Kunal Mehta (Legoktm) 2013-10-05 04:39:49 UTC
https://gerrit.wikimedia.org/r/#/c/84204/
Comment 9 Mpaa 2013-11-09 21:39:37 UTC
Fixed by above patch.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links