Last modified: 2014-11-16 00:08:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T36919, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 34919 - Language conversion is not applied in documents delivered by the Collection extension


Summary:	Language conversion is not applied in documents delivered by the Collection e...

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	Collection (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n

Depends on:	41716
Blocks:
	Show dependency tree / graph

Reported:	2012-03-03 01:22 UTC by Ziyuan Yao
Modified:	2014-11-16 00:08 UTC (History)
CC List:	14 users (show)

See Also:	http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Корисник:Никола Смоленски/Collection bugs.pdf (42.95 KB, application/pdf) 2014-09-25 20:28 UTC, Nemo	Details
Add an attachment (proposed patch, testcase, etc.)

Description Ziyuan Yao 2012-03-03 01:22:43 UTC

After the fixing of Bug 33430, the Chinese Wikipedia community says there is still another problem that prevents them from adopting the latest MediaWiki version that provides PDF/ebook creation for the Chinese Wikipedia.

This remaining problem is, because wiki text of the Chinese Wikipedia is a mix of both simplified and traditional Chinese (mainlanders tend to contribute edits in simplified Chinese, while Taiwanese / Hong Kongese tend to contribute in traditional Chinese), it needs to be converted to all-simplified or all-traditional before being displayed or made into PDFs.

Comment 1 Liangent 2012-03-03 04:33:52 UTC

Language converter is not only used on zhwiki.

Comment 2 Volker Haas 2012-03-05 08:20:18 UTC

Is the conversion to all-simplified of all-traditional done for "regular" display in the browser - and therefore only a problem with the PDFs at the moment? If that is the case: 

* how is the conversion done for the browser
* can someone provide a minimal example with simplified and traditional chinese
* what would be a good start to read in order to understand the problematic of simplified vs. traditional chinese and conversion methods

Comment 3 Ziyuan Yao 2012-03-05 08:40:29 UTC

The Chinese Wikipedia itself already has a simplified <-> traditional Chinese automatic conversion tool for displaying. It is explained here:

http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese

An example of the conversion in action:

Simplified: http://zh.wikipedia.org/zh-cn/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

Traditional: http://zh.wikipedia.org/zh-tw/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

Comment 4 Liangent 2012-03-05 09:14:41 UTC

(In reply to comment #2)
> Is the conversion to all-simplified of all-traditional done for "regular"
> display in the browser - and therefore only a problem with the PDFs at the
> moment? If that is the case: 
> 
> * how is the conversion done for the browser
> * can someone provide a minimal example with simplified and traditional chinese
> * what would be a good start to read in order to understand the problematic of
> simplified vs. traditional chinese and conversion methods

Technically the language conversion process is done after the normal parsing process. This means if you parse the article in your own way (to generate PDF) you have to apply conversion to your parser result manually. Note that the current converter (in languages/LanguageConverter.php) is just designed to convert HTML.

Comment 5 Ziyuan Yao 2012-03-05 09:18:53 UTC

I'm sure there are many PHP-based simplified/traditional Chinese conversion libraries.

Comment 6 Liangent 2012-03-05 09:21:17 UTC

(In reply to comment #5)
> I'm sure there are many PHP-based simplified/traditional Chinese conversion
> libraries.

mwlib (the wikitext parser & PDF generator used by Extension:Collection) is not written by PHP. Besides you have to consider conversion markups such as -{}-.

Comment 7 Volker Haas 2012-03-05 09:58:08 UTC

The conversion script doesn't exactly look trivial: http://svn.wikimedia.org/doc/LanguageConverter_8php_source.html

Does anybody have an idea how to get the conversion done without the need to reimplement the language converter in python suitable for mwlib?

Comment 8 Ziyuan Yao 2012-03-05 10:02:17 UTC

Google for an existing python-based conversion library?

Comment 9 Ralf Schmitt 2012-03-05 10:05:23 UTC

or just ask for patches?

Comment 10 Ziyuan Yao 2012-03-05 10:08:23 UTC

Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its API?

Comment 11 Liangent 2012-03-05 11:17:25 UTC

(In reply to comment #10)
> Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its
> API?

Even in LanguageConverter.php, more code is used to do, for example, parsing conversion markup, grabbing proper parts to convert, reading on-site conversion table, handle page links etc., than actually convert the text.

Comment 12 Ziyuan Yao 2012-03-07 04:44:20 UTC

I increasingly believe, such features should better be implemented on the client side, e.g. a "site to pdf ebook" program that converts a given site (blog, wiki, pages of certain depth from a start page, etc.) to a pdf.

Comment 13 Ziyuan Yao 2012-03-07 04:45:23 UTC

If you do it too "back end"-wise, you have to much processing in the middle, like this chinese conversion thing.

Comment 14 Volker Haas 2012-03-07 07:41:14 UTC

The problem with the "client-side" approach is that every client needs to re-implement these specific features (like the simple/traditional conversion).

If we ever use HTML as the base for PDF rendering this problem will be solved as long as MediaWiki takes care of the transformation. In the meantime I'd happily accept a patch for the problem, but I lack the time to implement the simple/traditional conversion.

Comment 15 Ziyuan Yao 2012-03-07 07:52:07 UTC

(In reply to comment #14)
> The problem with the "client-side" approach is that every client needs to
> re-implement these specific features (like the simple/traditional conversion).

No, because simple/traditional conversion is already taken care of by the Chinese Wikipedia on the server side.

> 
> If we ever use HTML as the base for PDF rendering this problem will be solved
> as long as MediaWiki takes care of the transformation. In the meantime I'd
> happily accept a patch for the problem, but I lack the time to implement the
> simple/traditional conversion.

That's exactly why I think third-party client-side or browser-side pdf/ebook creation solutions would provide what PrediaPress hasn't provided.

Comment 16 Tian-Jian "Barabbas" Jiang 2012-09-08 02:29:51 UTC

FYI, before LanguageConverter.php, there's a quick'n'dirty trail of LanguageZh.php: https://bugzilla.wikimedia.org/show_bug.cgi?id=5343

Comment 17 Nemo 2014-09-25 20:26:38 UTC

(In reply to Liangent from comment #6)
> Besides you have to consider conversion markups such as
> -{}-.

The test case provided by Nikola in http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574 is still valid:
https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9A%D1%9A%D0%B8%D0%B3%D0%B0&bookcmd=render_article&arttitle=%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D0%BA%3A%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0+%D0%A1%D0%BC%D0%BE%D0%BB%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%2FCollection+bugs&oldid=2610141&writer=rdf2latex

Comment 18 Nemo 2014-09-25 20:28:44 UTC

Created attachment 16595 [details]
Корисник:Никола Смоленски/Collection bugs.pdf

Serbian test case PDF as produced by [[mw:OCG]]/rdf2latex/new PDF rendering.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links