Last modified: 2013-11-12 12:17:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56289, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54289 - CsvResultPrinter needs UTF-8 byte order mark in order for Excel to properly recognize UTF-8 encoding
CsvResultPrinter needs UTF-8 byte order mark in order for Excel to properly r...
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
Semantic MediaWiki (Other open bugs)
master
All All
: Unprioritized enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-18 20:07 UTC by Chris Davis
Modified: 2013-11-12 12:17 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch showing how to add the UTF-8 byte order mark to the output of the CsvResultPrinter (792 bytes, patch)
2013-09-18 20:07 UTC, Chris Davis
Details

Description Chris Davis 2013-09-18 20:07:10 UTC
Created attachment 13314 [details]
Patch showing how to add the UTF-8 byte order mark to the output of the CsvResultPrinter

The problem can be duplicated by having an inline query which returns results containing non-latin characters.  If you specify the output format to be csv with a delimiter of a semicolon, once you click on the link for the results, Excel will open it directly.  While the file is encoded as UTF-8, Excel can't figure that out.  Reimporting the file into Excel with the correct encoding is a non-intuitive multi-step process that is a bit of a hassle.  

Appending the UTF-8 byte order mark (see http://roosmaa.net/importing-utf-8-csvs-in-excel/) to the output of the CsvResultPrinter makes everything work like it should.
Comment 1 Gerrit Notification Bot 2013-09-18 23:38:19 UTC
Change 84907 had a related patch set uploaded by Mwjames:
(Bug 54289) \SMW\CsvResultPrinter UTF-8 byte order mark

https://gerrit.wikimedia.org/r/84907
Comment 2 MWJames 2013-09-18 23:47:50 UTC
Just added the patch + test but would this UTF-8 byte order mark create recognition issues for non-Excel CSV interpreter?
Comment 3 Chris Davis 2013-09-19 07:36:14 UTC
After verifying that it works for Excel 14.0 (Office 2010) under Windows, I've also tried it on Ubuntu 13.04 with Gnumeric Spreadsheet 1.12.1 and LibreOffice Calc 4.0.2.2 without any problems.

So while things work for these programs, the more I read on the UTF-8 BOM, the more it seems like a hack to make the output work with Microsoft programs (https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 "Despite this, Microsoft...").  Apparently this can cause problems with programs that don't look for a BOM and try to parse it as data.  I'm not sure which programs would fall into this category, but I can test them if anyone has ideas.
Comment 4 Gerrit Notification Bot 2013-09-20 22:52:41 UTC
Change 84907 abandoned by Mwjames:
(Bug 54289) \SMW\CsvResultPrinter UTF-8 byte order mark

Reason:
Abandon this for now, needs a clear analysis so it doesn't causes more issues than it would solve.

https://gerrit.wikimedia.org/r/84907
Comment 5 [[kgh]] 2013-11-12 09:54:22 UTC
The "excel" result format [1] may be a way out here but I strongly believe that this result format should cater for Excel, too. This would definitively make life easier.

[1] http://semantic-mediawiki.org/wiki/Help:Excel_format
Comment 6 Chris Davis 2013-11-12 12:17:17 UTC
The more I've looked into this, the more I'm not convinced that CSV is a good format to use with Excel.  Beyond just the BOM issue, I've found that the expected delimiters are different depending on if you're in say Europe or the US.  In Europe, the comma is used as the decimal point, so Excel expects a semicolon delimiter.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links