Last modified: 2013-11-09 21:39:37 UTC
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1482/ Reported by: Anonymous user Created on: 2012-06-30 17:50:30 Subject: archivebot.py doesn't support unicode month names Original description: archivebot.py doesn't work well with languages such as Turkish which has some months with unicode characters. Namely: 2 Şubat 4 Mayıs 8 Ağustos 9 Eylül 11 Kasım 12 Aralık
Pywikipedia \[http\] trunk/pywikipedia \(r10432, 2012/06/30, 15:47:55\) Python 2.7.3 \(default, Apr 10 2012, 23:31:26\) \[MSC v.1500 32 bit \(Intel\)\] config-settings: use\_api = True use\_api\_login = True unicode test: ok
Command line I used was archivebot.py -l turkish Archive/config
Could you give us a traceback or further informations about that bug? The bot uses the monthnames coming from mediaWiki messages and I don't know what is the significance of the locale setting. Could you try to run the bot without --locale=tr setting?
Sure. There is no traceback error for me to provide though since the code does work, it just ignores some threads. Run1: archivebot.py -l turkish Archive/config Fetching template transclusions... Getting references to \[\[Sablon:Archive/config\]\] via API... Processing \[\[tr:Kullanici mesaj:??????\]\] 3 Threads found on \[\[tr:Kullanici mesaj:??????\]\] Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\] Processing 3 threads There are only 0 Threads. Skipping Run2: archivebot.py Archive/config Fetching template transclusions... Getting references to \[\[Sablon:Archive/config\]\] via API... Processing \[\[tr:Kullanici mesaj:??????\]\] 3 Threads found on \[\[tr:Kullanici mesaj:??????\]\] Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\] Processing 3 threads There are only 0 Threads. Skipping Note the Turkish character ı is displayed as i in the CMD window \(I run code using Windows\). The ???? relate to my user talk page http://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1\_mesaj:%E3%81%A8%E3%81%82%E3%82%8B%E7%99%BD%E3%81%84%E7%8C%AB but CMD cannot display unicode.
Oh when I ran the bot initially without -l turkish it ignored all threads. Since it already archived 3 of the 6 initial threads it is still reporting 0 Threads as it cannot see the ones with "Mayıs" month name.
Looked into this a bit. I've managed to isolate the problem to ~line 237 where all the txt2timestamp functions are. It seems that all of them are raising ValueErrors.
Tried this: import unicodedata @line 237 _TM = ''.join((c for c in unicodedata.normalize('NFD', TM.group(0)) if unicodedata.category(c) != 'Mn')) and then call txt2timestamp with _TM instead of TM.group(0)
https://gerrit.wikimedia.org/r/#/c/84204/
Fixed by above patch.