Last modified: 2014-10-16 12:08:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55398, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53398 - Zimlib is not able to deal correctly with unicode path on Windows
Zimlib is not able to deal correctly with unicode path on Windows
Status: NEW
Product: openZIM
Classification: Unclassified
zimlib (Other open bugs)
unspecified
All All
: Lowest major
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-27 11:13 UTC by Kelson [Emmanuel Engelhart]
Modified: 2014-10-16 12:08 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kelson [Emmanuel Engelhart] 2013-08-27 11:13:27 UTC
Windows uses utf16 to encode unicode paths. The zimlib uses char* to store the paths. For this reason, if you are unlucky (so use paths with characters over the first 8 bits) you are not able to open your ZIM file using the zimlib. We should provide a way to open paths in wchar_t* format, at least on Windows.
Comment 1 Brion Vibber 2013-09-03 19:21:56 UTC
Can this be done with the standard POSIX file APIs? Or do we need to use Win32 APIs for file access? :P

(Use of legacy 8-bit encodings in POSIX APIs is a problem for PHP, and thus MediaWiki, when running on Windows servers as well.)

Sounds like you gotta use the Win32 wide-char APIs to get Unicode filename support... at least that's what glib does on Windows for this reason -- https://developer.gnome.org/glib/2.26/glib-File-Utilities.html#glib-File-Utilities.description
Comment 2 Lode 2014-06-13 14:29:53 UTC
Hello,

I have done a little bit of research on this (I must admit it was on a Linux computer however).

I hope this may help at least with fixing this issue.

Most File, FileImpl, zim::ifstream and zim::streambuf works with string. There is one intermediate string -> const char* -> string conversion in FileImpl, which the attached patch removes (to use string for everything). This to hopefully support the NUL-characters that may appear in the octets of UTF-16 strings put in a 8-bit char string.

In addition, the patch has a comment at the locations in fstream.cpp where probably different Windows API needs to be called.

I think that this way, file.h does not need to be changed to have wchar_t or wstring constructors.

What follows below the line is the patch:

------------------

diff --git a/zimlib/include/zim/file.h b/zimlib/include/zim/file.h
index a6ac75b..8aa5f0d 100644
--- a/zimlib/include/zim/file.h
+++ b/zimlib/include/zim/file.h
@@ -39,7 +39,7 @@ namespace zim
       File()
         { }
       explicit File(const std::string& fname)
-        : impl(new FileImpl(fname.c_str()))
+        : impl(new FileImpl(fname))
         { }
 
       const std::string& getFilename() const   { return impl->getFilename(); }
diff --git a/zimlib/include/zim/fileimpl.h b/zimlib/include/zim/fileimpl.h
index 1cf584d..ada65ee 100644
--- a/zimlib/include/zim/fileimpl.h
+++ b/zimlib/include/zim/fileimpl.h
@@ -53,7 +53,7 @@ namespace zim
       offset_type getOffset(offset_type ptrOffset, size_type idx);
 
     public:
-      explicit FileImpl(const char* fname);
+      explicit FileImpl(const std::string& fname);
 
       time_t getMTime() const   { return zimFile.getMTime(); }
 
diff --git a/zimlib/src/fileimpl.cpp b/zimlib/src/fileimpl.cpp
index 8c072eb..d7a7f4a 100644
--- a/zimlib/src/fileimpl.cpp
+++ b/zimlib/src/fileimpl.cpp
@@ -38,7 +38,7 @@ namespace zim
   //////////////////////////////////////////////////////////////////////
   // FileImpl
   //
-  FileImpl::FileImpl(const char* fname)
+  FileImpl::FileImpl(const std::string& fname)
     : zimFile(fname),
       direntCache(envValue("ZIM_DIRENTCACHE", DIRENT_CACHE_SIZE)),
       clusterCache(envValue("ZIM_CLUSTERCACHE", CLUSTER_CACHE_SIZE))
diff --git a/zimlib/src/fstream.cpp b/zimlib/src/fstream.cpp
index 5ce72f5..e4accf8 100644
--- a/zimlib/src/fstream.cpp
+++ b/zimlib/src/fstream.cpp
@@ -59,6 +59,24 @@ namespace zim
 //
 streambuf::OpenfileInfo::OpenfileInfo(const std::string& fname_)
   : fname(fname_),
+// I think regular std::string with 8-bit characters
+// is OK up to this point, that is, in the ctor and fields of File,
+// FileImpl, ifstream, streambuf etc... (that is, no wchar_t or wstring
+// necessary there and no API change needed), the string just may have
+// NULL-characters if bytes of UTF-16 are present, but std::string
+// supports that (const char* does not).
+// So c_str() or any C/C++ API that uses C strings should not be used at
+// this point or anywhere before.
+// So here a Windows-specific API that supports UTF-16 is needed for
+// _WIN32, and the fname_ string needs to converted here to what it
+// needs if necessary. That is still TODO, hence the comment here.
+// Apparently the fname received here on Windows is UTF-16 encoded with
+// two chars per wchar_t, if used by Kiwix (the other possibility would
+// have been that it was UTF-8 encoded and needed to be converted to
+// UTF-16). Maybe a comment in the File constructor could also be nice
+// to ensure all users will be consistent with this?
+// This concludes my investigation, I hope it can help fixing the
+// issue :)
 #ifdef HAVE_OPEN64
     fd(::open64(fname.c_str(), O_RDONLY | O_LARGEFILE | O_BINARY))
 #else
@@ -293,6 +311,8 @@ time_t streambuf::getMTime() const
   if (mtime || files.empty())
     return mtime;
 
+// See comment under streambuf::OpenfileInfo::OpenfileInfo about UTF-16:
+// the same applies here as well to the c_str() call and stat.
   const char* fname = files.front()->fname.c_str();
 
 #ifdef HAVE_STAT64

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links