Last modified: 2014-08-07 14:03:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70724, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68724 - GWToolset should assume non unicode characters are windows-1252 not iso 8859-1
GWToolset should assume non unicode characters are windows-1252 not iso 8859-1
Status: RESOLVED INVALID
Product: MediaWiki extensions
Classification: Unclassified
GWToolset (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-28 10:11 UTC by Jean-Fred
Modified: 2014-08-07 14:03 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Jean-Fred 2014-07-28 10:11:00 UTC
Not sure if this is really GWToolset fault but in any case: I had a file with (apparently) invisible characters − most likely a bad encoding on the GLAM side of the character “œ” in the word “chœur”.

For information, here was the original CSV line:
APMH00004270;MH0004270;Ile-de-France;93;Saint-Denis;93066;Basilique Saint-Denis;;Stalles du chur;;Mieusement, Médéric (photographe);;;;;Négatif;PA00079952;Ministère de la Culture (France) - Médiathèque de l'architecture et du patrimoine - diffusion RMN;http://www.culture.gouv.fr/Wave/image/memoire/0403/sap01_mh004270_v.jpg;http://data.iledefrance.fr/api/datasets/1.0/photographies-serie-monuments-historiques-1851-a-1914/images/c5745a81dcc784be3affbd50dcd5c526/;48.9354612, 2.3598354;http://www.culture.gouv.fr/Wave/image/memoire/0403/sap01_mh004270_p.jpg

And the XML:
    <commons_title>Basilique_Saint-Denis_-_Stalles_du_chur_-_Saint-Denis_-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00004270.jpg</commons_title>

GWToolset uploaded the file with a character with the following title: “File:Basilique Saint-Denis - Stalles du chur - Saint-Denis - Médiathèque de l'architecture et du patrimoine - APMH00004270.jpg.jpg”

The character «  » is not displayed by MediaWiki (at least not in my browser/encoding/etc.)

<https://commons.wikimedia.org/w/index.php?title=File:Basilique_Saint-Denis_-_Stalles_du_ch%C2%9Cur_-_Saint-Denis_-_M%C3%A9diath%C3%A8que_de_l%27architecture_et_du_patrimoine_-_APMH00004270.jpg.jpg&redirect=no>

Maybe GWToolset should intercept that?
Comment 1 Bawolff (Brian Wolff) 2014-07-28 19:11:07 UTC
Ok. What happened is that the data was originally in a character set called windows-1252. In that character set "œ" is encoded as 0x9C. Somewhere along the lines, it got converted to utf-8, but during the conversion process it was assumed that the original data was in a character set called iso-8859-1. That character set uses 0x9C to mean "STRING TERMINATOR", which is an invisible character.

So the end result is the image had a title in MW with 0xC2 0x9C which is the UTF-8 code for "STRING TERMINATOR", instead of 0xC5 0x93 which is the UTF-8 code for LATIN SMALL LIGATURE OE.

-----

Its hard to tell at what step the error occurred. If the conversion error happened in the csv->xml transformation then its not gwtoolsets fault. If the error occured in the xml->upload step, then it would be. Could you maybe upload the relavent csv and xml files as attachments (Copy and pasting into bugzilla comments messes with the encoding)
Comment 2 Bawolff (Brian Wolff) 2014-07-28 19:13:59 UTC
p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are valid title characters (Although on commons they are blacklisted via title blacklist)
Comment 3 Bawolff (Brian Wolff) 2014-07-28 19:28:24 UTC
Actually I strongly suspect that it would be an issue in the csv->xml conversion and not gwtoolset, since having a raw 0x9C in the xml file would make the xml file invalid.
Comment 4 Jean-Fred 2014-07-28 21:24:21 UTC
(In reply to Bawolff (Brian Wolff) from comment #3)
> Actually I strongly suspect that it would be an issue in the csv->xml
> conversion and not gwtoolset, since having a raw 0x9C in the xml file would
> make the xml file invalid.

Indeed, I loaded the CSV as UTF-8 (as I always do), using Python codecs.open(csv_file, 'r', 'utf-8'). Never suspected the CSV might be in windows-1252.

If "STRING TERMINATOR" is valid then I suppose all is fine. :) Marking as INVALID.
Comment 5 Jean-Fred 2014-07-28 21:26:10 UTC
(In reply to Bawolff (Brian Wolff) from comment #2)
> p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are
> valid title characters (Although on commons they are blacklisted via title
> blacklist)

You mean GWToolset ignores the title blacklist? That sounds bad.

(I noticed this error because the bot I fired to rename all the images of this batch choked on these 10 files with "STRING TERMINATOR" with an APIError. Not sure if the fault lies with Pywikibot, the MediaWiki API or something else, but such file titles are definitely a problem.
Comment 6 Bawolff (Brian Wolff) 2014-07-28 21:36:56 UTC
(In reply to Jean-Fred from comment #5)
> (In reply to Bawolff (Brian Wolff) from comment #2)
> > p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are
> > valid title characters (Although on commons they are blacklisted via title
> > blacklist)
> 
> You mean GWToolset ignores the title blacklist? That sounds bad.
> 
> (I noticed this error because the bot I fired to rename all the images of
> this batch choked on these 10 files with "STRING TERMINATOR" with an
> APIError. Not sure if the fault lies with Pywikibot, the MediaWiki API or
> something else, but such file titles are definitely a problem.

Yes. The 0xC9 should be blocked by the
 .*\p{Cc}.* <casesensitive|errmsg=titleblacklist-custom-hidden-char> # Control characters

rule. Well such characters may technically be valid title characters according to MediaWiki. There is really no good reason to ever use them. Almost to the point where one might want to assume that things were converted wrong and automatically try and re-convert as if its windows-1252.


[As an offtopic aside, Commons also blocks all astral characters (Mostly dead languages and emoticons, but also a bunch of chinese-japanese-korean characters), which seems a tad bit restrictive for a multi-lingual project of the scope that commons is...]
Comment 7 Bawolff (Brian Wolff) 2014-08-07 14:03:29 UTC
I kind of changed my mind about this. See bug 69236

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links