Last modified: 2013-11-12 12:17:17 UTC
Created attachment 13314 [details] Patch showing how to add the UTF-8 byte order mark to the output of the CsvResultPrinter The problem can be duplicated by having an inline query which returns results containing non-latin characters. If you specify the output format to be csv with a delimiter of a semicolon, once you click on the link for the results, Excel will open it directly. While the file is encoded as UTF-8, Excel can't figure that out. Reimporting the file into Excel with the correct encoding is a non-intuitive multi-step process that is a bit of a hassle. Appending the UTF-8 byte order mark (see http://roosmaa.net/importing-utf-8-csvs-in-excel/) to the output of the CsvResultPrinter makes everything work like it should.
Change 84907 had a related patch set uploaded by Mwjames: (Bug 54289) \SMW\CsvResultPrinter UTF-8 byte order mark https://gerrit.wikimedia.org/r/84907
Just added the patch + test but would this UTF-8 byte order mark create recognition issues for non-Excel CSV interpreter?
After verifying that it works for Excel 14.0 (Office 2010) under Windows, I've also tried it on Ubuntu 13.04 with Gnumeric Spreadsheet 1.12.1 and LibreOffice Calc 4.0.2.2 without any problems. So while things work for these programs, the more I read on the UTF-8 BOM, the more it seems like a hack to make the output work with Microsoft programs (https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 "Despite this, Microsoft..."). Apparently this can cause problems with programs that don't look for a BOM and try to parse it as data. I'm not sure which programs would fall into this category, but I can test them if anyone has ideas.
Change 84907 abandoned by Mwjames: (Bug 54289) \SMW\CsvResultPrinter UTF-8 byte order mark Reason: Abandon this for now, needs a clear analysis so it doesn't causes more issues than it would solve. https://gerrit.wikimedia.org/r/84907
The "excel" result format [1] may be a way out here but I strongly believe that this result format should cater for Excel, too. This would definitively make life easier. [1] http://semantic-mediawiki.org/wiki/Help:Excel_format
The more I've looked into this, the more I'm not convinced that CSV is a good format to use with Excel. Beyond just the BOM issue, I've found that the expected delimiters are different depending on if you're in say Europe or the US. In Europe, the comma is used as the decimal point, so Excel expects a semicolon delimiter.