Last modified: 2014-11-16 00:08:02 UTC
After the fixing of Bug 33430, the Chinese Wikipedia community says there is still another problem that prevents them from adopting the latest MediaWiki version that provides PDF/ebook creation for the Chinese Wikipedia. This remaining problem is, because wiki text of the Chinese Wikipedia is a mix of both simplified and traditional Chinese (mainlanders tend to contribute edits in simplified Chinese, while Taiwanese / Hong Kongese tend to contribute in traditional Chinese), it needs to be converted to all-simplified or all-traditional before being displayed or made into PDFs.
Language converter is not only used on zhwiki.
Is the conversion to all-simplified of all-traditional done for "regular" display in the browser - and therefore only a problem with the PDFs at the moment? If that is the case: * how is the conversion done for the browser * can someone provide a minimal example with simplified and traditional chinese * what would be a good start to read in order to understand the problematic of simplified vs. traditional chinese and conversion methods
The Chinese Wikipedia itself already has a simplified <-> traditional Chinese automatic conversion tool for displaying. It is explained here: http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese An example of the conversion in action: Simplified: http://zh.wikipedia.org/zh-cn/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD Traditional: http://zh.wikipedia.org/zh-tw/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD
(In reply to comment #2) > Is the conversion to all-simplified of all-traditional done for "regular" > display in the browser - and therefore only a problem with the PDFs at the > moment? If that is the case: > > * how is the conversion done for the browser > * can someone provide a minimal example with simplified and traditional chinese > * what would be a good start to read in order to understand the problematic of > simplified vs. traditional chinese and conversion methods Technically the language conversion process is done after the normal parsing process. This means if you parse the article in your own way (to generate PDF) you have to apply conversion to your parser result manually. Note that the current converter (in languages/LanguageConverter.php) is just designed to convert HTML.
I'm sure there are many PHP-based simplified/traditional Chinese conversion libraries.
(In reply to comment #5) > I'm sure there are many PHP-based simplified/traditional Chinese conversion > libraries. mwlib (the wikitext parser & PDF generator used by Extension:Collection) is not written by PHP. Besides you have to consider conversion markups such as -{}-.
The conversion script doesn't exactly look trivial: http://svn.wikimedia.org/doc/LanguageConverter_8php_source.html Does anybody have an idea how to get the conversion done without the need to reimplement the language converter in python suitable for mwlib?
Google for an existing python-based conversion library?
or just ask for patches?
Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its API?
(In reply to comment #10) > Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its > API? Even in LanguageConverter.php, more code is used to do, for example, parsing conversion markup, grabbing proper parts to convert, reading on-site conversion table, handle page links etc., than actually convert the text.
I increasingly believe, such features should better be implemented on the client side, e.g. a "site to pdf ebook" program that converts a given site (blog, wiki, pages of certain depth from a start page, etc.) to a pdf.
If you do it too "back end"-wise, you have to much processing in the middle, like this chinese conversion thing.
The problem with the "client-side" approach is that every client needs to re-implement these specific features (like the simple/traditional conversion). If we ever use HTML as the base for PDF rendering this problem will be solved as long as MediaWiki takes care of the transformation. In the meantime I'd happily accept a patch for the problem, but I lack the time to implement the simple/traditional conversion.
(In reply to comment #14) > The problem with the "client-side" approach is that every client needs to > re-implement these specific features (like the simple/traditional conversion). No, because simple/traditional conversion is already taken care of by the Chinese Wikipedia on the server side. > > If we ever use HTML as the base for PDF rendering this problem will be solved > as long as MediaWiki takes care of the transformation. In the meantime I'd > happily accept a patch for the problem, but I lack the time to implement the > simple/traditional conversion. That's exactly why I think third-party client-side or browser-side pdf/ebook creation solutions would provide what PrediaPress hasn't provided.
FYI, before LanguageConverter.php, there's a quick'n'dirty trail of LanguageZh.php: https://bugzilla.wikimedia.org/show_bug.cgi?id=5343
(In reply to Liangent from comment #6) > Besides you have to consider conversion markups such as > -{}-. The test case provided by Nikola in http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574 is still valid: https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9A%D1%9A%D0%B8%D0%B3%D0%B0&bookcmd=render_article&arttitle=%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D0%BA%3A%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0+%D0%A1%D0%BC%D0%BE%D0%BB%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%2FCollection+bugs&oldid=2610141&writer=rdf2latex
Created attachment 16595 [details] Корисник:Никола Смоленски/Collection bugs.pdf Serbian test case PDF as produced by [[mw:OCG]]/rdf2latex/new PDF rendering.