Last modified: 2014-10-16 12:08:18 UTC
Windows uses utf16 to encode unicode paths. The zimlib uses char* to store the paths. For this reason, if you are unlucky (so use paths with characters over the first 8 bits) you are not able to open your ZIM file using the zimlib. We should provide a way to open paths in wchar_t* format, at least on Windows.
Can this be done with the standard POSIX file APIs? Or do we need to use Win32 APIs for file access? :P (Use of legacy 8-bit encodings in POSIX APIs is a problem for PHP, and thus MediaWiki, when running on Windows servers as well.) Sounds like you gotta use the Win32 wide-char APIs to get Unicode filename support... at least that's what glib does on Windows for this reason -- https://developer.gnome.org/glib/2.26/glib-File-Utilities.html#glib-File-Utilities.description
Hello, I have done a little bit of research on this (I must admit it was on a Linux computer however). I hope this may help at least with fixing this issue. Most File, FileImpl, zim::ifstream and zim::streambuf works with string. There is one intermediate string -> const char* -> string conversion in FileImpl, which the attached patch removes (to use string for everything). This to hopefully support the NUL-characters that may appear in the octets of UTF-16 strings put in a 8-bit char string. In addition, the patch has a comment at the locations in fstream.cpp where probably different Windows API needs to be called. I think that this way, file.h does not need to be changed to have wchar_t or wstring constructors. What follows below the line is the patch: ------------------ diff --git a/zimlib/include/zim/file.h b/zimlib/include/zim/file.h index a6ac75b..8aa5f0d 100644 --- a/zimlib/include/zim/file.h +++ b/zimlib/include/zim/file.h @@ -39,7 +39,7 @@ namespace zim File() { } explicit File(const std::string& fname) - : impl(new FileImpl(fname.c_str())) + : impl(new FileImpl(fname)) { } const std::string& getFilename() const { return impl->getFilename(); } diff --git a/zimlib/include/zim/fileimpl.h b/zimlib/include/zim/fileimpl.h index 1cf584d..ada65ee 100644 --- a/zimlib/include/zim/fileimpl.h +++ b/zimlib/include/zim/fileimpl.h @@ -53,7 +53,7 @@ namespace zim offset_type getOffset(offset_type ptrOffset, size_type idx); public: - explicit FileImpl(const char* fname); + explicit FileImpl(const std::string& fname); time_t getMTime() const { return zimFile.getMTime(); } diff --git a/zimlib/src/fileimpl.cpp b/zimlib/src/fileimpl.cpp index 8c072eb..d7a7f4a 100644 --- a/zimlib/src/fileimpl.cpp +++ b/zimlib/src/fileimpl.cpp @@ -38,7 +38,7 @@ namespace zim ////////////////////////////////////////////////////////////////////// // FileImpl // - FileImpl::FileImpl(const char* fname) + FileImpl::FileImpl(const std::string& fname) : zimFile(fname), direntCache(envValue("ZIM_DIRENTCACHE", DIRENT_CACHE_SIZE)), clusterCache(envValue("ZIM_CLUSTERCACHE", CLUSTER_CACHE_SIZE)) diff --git a/zimlib/src/fstream.cpp b/zimlib/src/fstream.cpp index 5ce72f5..e4accf8 100644 --- a/zimlib/src/fstream.cpp +++ b/zimlib/src/fstream.cpp @@ -59,6 +59,24 @@ namespace zim // streambuf::OpenfileInfo::OpenfileInfo(const std::string& fname_) : fname(fname_), +// I think regular std::string with 8-bit characters +// is OK up to this point, that is, in the ctor and fields of File, +// FileImpl, ifstream, streambuf etc... (that is, no wchar_t or wstring +// necessary there and no API change needed), the string just may have +// NULL-characters if bytes of UTF-16 are present, but std::string +// supports that (const char* does not). +// So c_str() or any C/C++ API that uses C strings should not be used at +// this point or anywhere before. +// So here a Windows-specific API that supports UTF-16 is needed for +// _WIN32, and the fname_ string needs to converted here to what it +// needs if necessary. That is still TODO, hence the comment here. +// Apparently the fname received here on Windows is UTF-16 encoded with +// two chars per wchar_t, if used by Kiwix (the other possibility would +// have been that it was UTF-8 encoded and needed to be converted to +// UTF-16). Maybe a comment in the File constructor could also be nice +// to ensure all users will be consistent with this? +// This concludes my investigation, I hope it can help fixing the +// issue :) #ifdef HAVE_OPEN64 fd(::open64(fname.c_str(), O_RDONLY | O_LARGEFILE | O_BINARY)) #else @@ -293,6 +311,8 @@ time_t streambuf::getMTime() const if (mtime || files.empty()) return mtime; +// See comment under streambuf::OpenfileInfo::OpenfileInfo about UTF-16: +// the same applies here as well to the c_str() call and stat. const char* fname = files.front()->fname.c_str(); #ifdef HAVE_STAT64