Last modified: 2014-10-17 15:27:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T43716, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 41716 - Support language variant conversion in Parsoid


Summary:	Support language variant conversion in Parsoid

Status:	PATCH_TO_REVIEW

Product:	Parsoid
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low normal
Target Milestone:	---
Assigned To:	C. Scott Ananian

URL:
Whiteboard:
Keywords:	i18n

Duplicates:	51325 (view as bug list)
Depends on:	19044 43547 51587 52661 46658
Blocks:	34919 47411 47913 ve-nonenglish 53784 71815 43332
	Show dependency tree / graph

Reported:	2012-11-02 22:45 UTC by Liangent
Modified:	2014-10-17 15:27 UTC (History)
CC List:	14 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Liangent 2012-11-02 22:45:23 UTC

I'm not expecting this will happen soon. Just leave this bug here.

The required features are describe below. Some may belong to VisualEditor.

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks to avoid breakage.

Phase 2: Enable editing of these conversion blocks.

Phase 3: Convert all text in DOM to requested variant, and convert it back to original variant when constructing wikitext. Don't change text to another variant if user doesn't edit that word in DOM. Shadowing whole text may be needed here.

Comment 1 Gabriel Wicke 2012-11-02 22:46:59 UTC

Oh, that will be fun ;)

Comment 2 Gabriel Wicke 2013-02-19 18:59:59 UTC

It would be good to look into implementing phase 1 (recognize and protect language conversion content).

Comment 3 Mark Holmquist 2013-02-20 00:55:48 UTC

Liangent, can you please link us to documentation about how this works? Initial searches have been less than fruitful.

Comment 4 Mark Holmquist 2013-02-20 01:17:42 UTC

Next question: Should a construct like {{variantopen}}令{{variantclose}} work (assume it expands to -{令}-)? If not, would it be difficult to phase that construct out as deprecated and go forward with Parsoid not supporting it?

Thanks for your help, we'd love to get Parsoid working with zh-wikis.

Comment 5 Gabriel Wicke 2013-02-20 22:04:30 UTC

Some documentation: 

* https://www.mediawiki.org/wiki/Language_converter
* https://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese

Comment 6 Liangent 2013-02-24 11:55:00 UTC

(In reply to comment #3)
> Liangent, can you please link us to documentation about how this works?
> Initial
> searches have been less than fruitful.

Do you means how it's done in the PHP parsing process, or what is expected to be done (specification of related syntax)?

Comment 7 Liangent 2013-02-24 12:02:57 UTC

(In reply to comment #4)
> Next question: Should a construct like {{variantopen}}令{{variantclose}} work
> (assume it expands to -{令}-)? If not, would it be difficult to phase that
> construct out as deprecated and go forward with Parsoid not supporting it?
> 
> Thanks for your help, we'd love to get Parsoid working with zh-wikis.

That construct works in the PHP converter.

Comment 8 Mark Holmquist 2013-03-27 22:02:25 UTC

We really need to know about how this is *supposed* to go, and we need English documentation for it if our team is going to work on it. The current offerings are all in other languages I think.

Comment 9 Gabriel Wicke 2013-03-28 17:24:16 UTC

A few notes from IRC:


[09:59] <gwicke> marktraceur: I browsed the LanguageConverter source a bit
[09:59] <gwicke> there is an autoConvert method that simply converts all text based on a dictionary lookup
[10:00] <gwicke> it only excludes markup and script/code blocks
[10:00] <gwicke> the default search language for Chinese seems to be zh-hans (simplified)
[10:01] <gwicke> am not sure when the special conversion syntax is used in practice
[10:03] --> tewwy has joined this channel (~tychay@wikimedia/Tychay).
[10:03] <gwicke> conversion is restricted to those blocks when using convert() and convertTo()
[10:03] <gwicke> plus special conversion for link targets and headings
[10:04] <gwicke> the conversion itself is performed using autoConvert (the dictionary-based method)
[10:06] * cscott is reading backlog
[10:07] <cscott> yeah, i mentioned getting minority-language buy-in in the meeting yesterday, thinking specifically of how hard it's been to get i18n feedback
[10:07] --> HaeB has joined this channel (~quassel@wikipedia/HochaufeinemBaum).
[10:08] <cscott> this languageconverter thing is changing simplified chinese to traditional, and vice-versa?  ie, mainland-to-taiwan and back?
[10:13] <gwicke> cscott: there are four variants for Chinese I think
[10:13] <gwicke> Serbian and some other languages have variants too
[10:14] <gwicke> marktraceur: so my reading is that normally convert() is used, which only converts marked-up blocks (-{ }-)
[10:15] <gwicke> except for search, which uses autoconvert directly
[10:16] <gwicke> the conversion is also lossy, but less ambiguous when converting from traditional to simplified for example
[10:16] <gwicke> now the question is how we should represent all this in the DOM
[10:20] *** edsanders|away is now known as edsanders.
[10:20] <gwicke> on one hand it would be nice to abstract the issue, but with the conversion being lossy that does not seem to be possible without preserving the original (potentially mixed-variant) text

Comment 10 Gabriel Wicke 2013-03-28 17:48:06 UTC

DOM spec being developed at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks.

Comment 11 Liangent 2013-03-29 11:23:47 UTC

(In reply to comment #9)
> A few notes from IRC:

Let me explain more:

The main entry point should be convertTo(), with convert() as a shortcut to use the "preferred" (= automatically guessed from request) variant. It accepts an almost-parsed HTML document (string) with -{}- markups embedded.

convertTo() is just a loader. It calls recursiveConvert* afterwards, which parse -{}- syntax, and break text into pieces based on -{}- markups. These pieces are fed into autoConvert().

autoConvert() extracts text snippets which actually need conversion (with HTML tags, <code> blocks etc. excluded, but include "title" attribs in HTML tags again...), then send these snippets to translate().

translate() transforms text finally using strtr()-like mechanism.

Comment 12 Gabriel Wicke 2013-03-29 23:36:38 UTC

Some more info:
http://www.mediawiki.org/wiki/Parsoid/Language_conversion

Comment 13 Liangent 2013-03-30 08:42:39 UTC

(In reply to comment #12)
> Some more info:
> http://www.mediawiki.org/wiki/Parsoid/Language_conversion

Maybe you want to avoid pasting IPs in those join-messages onto the wiki next time. :)

Comment 14 Gabriel Wicke 2013-03-30 16:06:02 UTC

The channel is public anyway, but pasting them on the wiki certainly makes it easier to search for names. It might be a good idea for you to get an IRC hostmask cloak, so that the IP does not show up in IRC logs.

Comment 15 Liangent 2013-03-30 16:09:34 UTC

(In reply to comment #14)
> The channel is public anyway, but pasting them on the wiki certainly makes it
> easier to search for names. It might be a good idea for you to get an IRC
> hostmask cloak, so that the IP does not show up in IRC logs.

I already have one, but I often see this happening:

[09:23] --> spectie has joined this channel (~fran@***).
[09:23] <-- spectie has left this server (Changing host).
[09:23] --> spectie has joined this channel (~fran@unaffiliated/spectie).

I guess it happpens when the user /msg nickserv identify xxx after they joins the channel, and the sequence is usually decided by their IRC client.

Comment 16 Liangent 2013-07-01 21:43:51 UTC

About global state of dictionaries: the table affected by -{H| }- is used for link & categorylink resolution too. We may want to keep this behavior.

Comment 17 Andre Klapper 2013-07-04 10:33:08 UTC

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

Comment 18 Liangent 2013-07-05 20:54:37 UTC

(In reply to comment #16)
> About global state of dictionaries: the table affected by -{H| }- is used for
> link & categorylink resolution too. We may want to keep this behavior.

One thing more about -{H| }-: the current behavior is that it only affects text after it and this behavior is sometimes deliberately used. We may want to keep it.

Comment 19 Gabriel Wicke 2013-07-09 02:40:05 UTC

(In reply to comment #18)
> (In reply to comment #16)
> > About global state of dictionaries: the table affected by -{H| }- is used for
> > link & categorylink resolution too. We may want to keep this behavior.
> 
> One thing more about -{H| }-: the current behavior is that it only affects
> text
> after it and this behavior is sometimes deliberately used. We may want to
> keep
> it.

For us mutable global state is very hard to support in any sane way. Having page-global dictionary definitions or self-contained manual conversions is fine, but changing global state in the middle of the page (even from a dynamically changing template) conflicts with a lot of optimizations and is hard to represent in a UI.

Comment 20 Liangent 2013-07-14 17:54:56 UTC

*** Bug 51325 has been marked as a duplicate of this bug. ***

Comment 21 Gabriel Wicke 2013-07-18 00:25:01 UTC

Changed the title back to "Support language variant conversion in Parsoid" as this is not just about the syntax.

Comment 22 Liangent 2013-07-18 00:26:51 UTC

(In reply to comment #21)
> Changed the title back to "Support language variant conversion in Parsoid" as
> this is not just about the syntax.

There're too many things, far more from what I mentioned in comment 0 and I may be going to add some separate bugs from time to time... or do you want to use this one as some "meta" bug?

Comment 23 Gabriel Wicke 2013-07-18 01:17:17 UTC

Yes, this is the meta bug that depends on several other bugs (see the "Depends on" field).

Once we have a good overview of the issues we should probably get together to discuss possible solutions. Will you be at Wikimania?

Comment 24 C. Scott Ananian 2013-08-09 02:40:17 UTC

See also bug 52661 -- the language converter should be integrated better with the preprocessor, in both PHP and Parsoid.

Comment 25 Gabriel Wicke 2013-08-10 07:35:26 UTC

More info: https://www.mediawiki.org/wiki/Writing_systems/Syntax

Comment 26 Gabriel Wicke 2013-08-11 04:09:39 UTC

(In reply to comment #24)
> See also bug 52661 -- the language converter should be integrated better with
> the preprocessor, in both PHP and Parsoid.

The language converter is actually a post-processor rather than a preprocessor. Why should that change?

Comment 27 Liangent 2013-08-11 04:41:50 UTC

(In reply to comment #26)
> (In reply to comment #24)
> > See also bug 52661 -- the language converter should be integrated better with
> > the preprocessor, in both PHP and Parsoid.
> 
> The language converter is actually a post-processor rather than a
> preprocessor.
> Why should that change?

Their point is to have the preprocess understand those markups, to avoid interpreting them as something else.

Comment 28 C. Scott Ananian 2013-08-12 15:31:11 UTC

@Gwicke wrt comment 26 -- because it would fix the bugs documented in bug 52661.  (In particular, <gallery> is in sad shape right now.)

Comment 29 C. Scott Ananian 2013-08-15 16:30:24 UTC

See also http://www.mediawiki.org/wiki/Requests_for_comment/Scoped_language_converter

gwicke has an alternate proposal, which I'm sure he'll link here at some point.

As I understand it, we will parse the language converter markup, and then we will have a post-processing step which will actually apply the rules and markup to convert the text into the desired variant.  As discussed (to some extent) in bug 15161, ideally visual editor would present the text in the user's preferred variant and then we would leverage the selser mechanism to ensure a change in variant applies only to the edited portion of the text.  Again, ideally DOM blocks would be annotated with the variant they were written in.

Comment 30 Gabriel Wicke 2013-08-15 17:24:38 UTC

A rough outline of my proposal as developed during Wikimania with Liangent is at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Scott would like to add additional syntax, while I am proposing a two-phase approach that 1) aims at supporting visual editing of existing content and 2) builds the infrastructure for clean language variant conversion based on page-global and category-global rules and then migrates dynamic rule table modifications out of articles and templates.

Comment 31 Liangent 2013-08-15 18:07:42 UTC

(In reply to comment #30)
> A rough outline of my proposal as developed during Wikimania with Liangent is
> at
> http://www.mediawiki.org/wiki/Parsoid/
> MediaWiki_DOM_spec#Language_conversion_blocks.
> Scott would like to add additional syntax, while I am proposing a two-phase
> approach that 1) aims at supporting visual editing of existing content and 2)
> builds the infrastructure for clean language variant conversion based on
> page-global and category-global rules and then migrates dynamic rule table
> modifications out of articles and templates.

However this plan is only doable after Parsoid become the default parser, and all migration process must be done at exactly the same time as Parsoid becoming so... to keep everything working. Because (1) PHP parser doesn't understand your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

Comment 32 Gabriel Wicke 2013-08-15 18:49:04 UTC

(In reply to comment #31)
> However this plan is only doable after Parsoid become the default parser, and
> all migration process must be done at exactly the same time as Parsoid
> becoming
> so... to keep everything working. Because (1) PHP parser doesn't understand
> your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

No, this is not depending on Parsoid becoming the default parser. It does depend on efficient access to global conversion rules at parse time, which is true for both approaches.

The main difference is that I favor direct (and mostly automatic) migration of rules to versioned page metadata for efficient access and gadget / UI-based editing. The processing model is also designed to be efficient with independent transclusion expansions as done in Parsoid.

Scott prefers to store rules in page and template content instead, and lets rules leak out of templates.

Comment 33 Liangent 2013-08-15 19:00:16 UTC

(In reply to comment #32)
> (In reply to comment #31)
> > However this plan is only doable after Parsoid become the default parser, and
> > all migration process must be done at exactly the same time as Parsoid
> > becoming
> > so... to keep everything working. Because (1) PHP parser doesn't understand
> > your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.
> 
> No, this is not depending on Parsoid becoming the default parser. It does
> depend on efficient access to global conversion rules at parse time, which is
> true for both approaches.
> 
> The main difference is that I favor direct (and mostly automatic) migration
> of
> rules to versioned page metadata for efficient access and gadget / UI-based
> editing. The processing model is also designed to be efficient with
> independent
> transclusion expansions as done in Parsoid.
> 
> Scott prefers to store rules in page and template content instead, and lets
> rules leak out of templates.

Scott's '"Category" Proposal' seems not leaking?

Comment 34 Gabriel Wicke 2013-08-15 23:09:23 UTC

(In reply to comment #33)
> Scott's '"Category" Proposal' seems not leaking?

The category variant in Scott's RFC is close to what I have been advocating for a while. He does not rule out leaking of rules out of templates, but mentions the problems associated with doing so. So it might or might not be leaking.

See https://www.mediawiki.org/wiki/Requests_for_comment/Page_and_category_based_language_variant_conversion for a more detailed write-up of my proposal.

Comment 35 C. Scott Ananian 2013-08-16 16:50:24 UTC

Can you improve your RFC to specify more precisely the scoping you anticipate for 'global' rules?  In particular, it seems that a global rule defined in a page *does* affect the content of templates included in the page (a sort of leak).  What happens to when a template defines a global rule?  Does it get added to the inherited global rules from the parent page, and then applies to any subtemplates?

FWIW, my Category proposal does state that page-scope templates do not leak -- neither into templates nor up to enclosing context.  I'm not 100% sure that's desirable, but that's how it currently reads.)

Comment 36 D Chan 2013-09-11 20:01:26 UTC

I think we should be *extremely* restrictive about where language rules can leak. This is because they lead to several problems:

(1) Rule changes make it hard to give a faithful real-time view of *any* plaintext.

(2) Rule changes can cause unexpected errors in distant text.

(3) Few people can proofread both zh-Hans and zh-Hant. Therefore, almost anyone who makes an edit will be unable to proofread at least one of the variants it might affect.

On the other hand, leaking currently allows pages to import rules. I think we should preserve this facility but make it more separate.

1. In general, there should be no leakage: rules should be page-global, and should not leak into or out of templates. This means template *arguments* should be subject to the rules of the page in which they are written, but text generated by a template should not.

2. As an exception to the "no leakage" rule, there should be a new type of template called a Glossary, whose only purpose is to leak rules into the calling page. That way, language rules are completely separate and independent of any other template behaviour. These Glossaries should be referenced at the top of the page only.

3. The page which defines a template is free to use rules and Glossaries too. But they will only affect the text generated by the template itself -- they won't leak into any text defined in the calling page. This includes the arguments passed into the template, because they're written in the calling page.

As you can see, this is just cscott's "Global" proposal, but with the additional restriction that the templates that leak rules cannot have any other functionality.

Comment 37 Gabriel Wicke 2013-09-13 07:04:38 UTC

David, Roan, Scott, Subbu and me met in the office to discuss this. Short summary of the plans for the next steps:

1) Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.

2) Parse all -{ }- syntax and represent it in the DOM. Exact spec TBD in https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Render the default variant according to the fallback chain for output-producing rules.

3) Enable editing of inline (once-only) rules in the VE. Most rule table modifications seem to be templated and will not be applied, so are not directly relevant. Rules that only modify the table but produce no output directly in page content can be represented as mw:Placeholder and will simply be preserved.

This will make the VE usable for typical editors on variant-enabled wikis without requiring the variant conversion overhaul to be done first.

For the longer-term strategy, we (mostly) agreed on:

1) Add the capability to associate an ordered list of glossaries with a page. These can either be stored in a separate namespace, or something like Special:Glossary. They should be revision-controlled and machine-readable for processing and UI purposes (JSON).

2) Add the capability to add page-specific rules that override glossary rules. Only glossaries and global rules associated with the top-level page itself are considered. This makes the set of conversion rules independent of dynamic template expansions.

3) Apply the combined rule set to the entire page including templated content. Rationales:
* Simple mental model
* efficient to implement
* consistent conversion of passed-in content, even if it is massaged further during transclusion expansion
* content in templates (labels, also real content in some infoboxes) themselves can still be protected or converted differently with local inline rules, as is done right now

The details on how this can be implemented depend on whether we reach our goal of implementing multi-part revision storage that we can use for metadata by the next quarter.

PS @David: Conversion rules should be passed into a pure function that converts each template expansion. Nothing at all should leak- otherwise our function would no longer be pure, and we could no longer efficiently update template expansions independently.

Comment 38 Liangent 2013-09-13 12:09:23 UTC

(In reply to comment #37)
> David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> summary of the plans for the next steps:
> 
> 1) Find nesting issues and see if we can fix them up with a bot. Also
> investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

Comment 39 ssastry 2013-09-13 15:13:43 UTC

(In reply to comment #38)
> (In reply to comment #37)
> > David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> > summary of the plans for the next steps:
> > 
> > 1) Find nesting issues and see if we can fix them up with a bot. Also
> > investigate use cases for markup in variant conversion rules.
> 
> Why do we want to get rid of nested -{}- markups? It's useful in some cases.
> 
> See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
> replacement for [[Template:地区用词]]).
> 
> Try to expand a [[Template:地区用词3]] call and see its result:
> 
> {{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking about. 

What we had in mind was use in attributes. Ex:-{zh-cn=<span style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are proposing using a bot to fix this to: <span style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form has the property that all HTML snippets have a well-formed DOM representation whereas the original does not.

Comment 40 Liangent 2013-09-13 15:54:31 UTC

(In reply to comment #39)
> (In reply to comment #38)
> > (In reply to comment #37)
> > > David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> > > summary of the plans for the next steps:
> > > 
> > > 1) Find nesting issues and see if we can fix them up with a bot. Also
> > > investigate use cases for markup in variant conversion rules.
> > 
> > Why do we want to get rid of nested -{}- markups? It's useful in some cases.
> > 
> > See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
> > replacement for [[Template:地区用词]]).
> > 
> > Try to expand a [[Template:地区用词3]] call and see its result:
> > 
> > {{地区用词3|zh-cn=cn|zh-tw=tw}}
> 
> Liangent: Gwicke did not fully explain the nesting issue we were talking
> about. 
> 
> What we had in mind was use in attributes. Ex:-{zh-cn=<span
> style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are
> proposing
> using a bot to fix this to: <span
> style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form
> has the property that all HTML snippets have a well-formed DOM representation
> whereas the original does not.

Oh that's what we've discussed before - and that's fine.

Comment 41 Gerrit Notification Bot 2013-10-08 22:29:50 UTC

Change 50767 abandoned by GWicke:
Revert "(bug 41716) Add variant config to siprop=general"

Reason:
This ship has sadly sailed. Too late to clean it up I guess. Sigh.

https://gerrit.wikimedia.org/r/50767

Comment 42 C. Scott Ananian 2014-06-17 15:49:46 UTC

@Liangent: can you describe the use cases for "the other kind" of nested markup?
That is, -{ }- inside -{ }-?

Our proposed DOM tree (https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks) can handle:

-{ foo -{ bar }- bat }-

but not

foo-{zh-cn:blog -{ nested }-; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

etc.

How are nested -{ }- markups of this sort actually used?

Comment 43 Liangent 2014-06-17 15:55:32 UTC

They're mostly used in template. See also [[zh:Template:DISPLAYTITLE]] and [[zh:Module:Template:地区用词]].

Comment 44 Gerrit Notification Bot 2014-06-17 22:07:08 UTC

Change 140235 had a related patch set uploaded by Cscott:
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

Comment 45 C. Scott Ananian 2014-06-18 15:33:26 UTC

I've written up some notes about nested conversion blocks and other discoveries at https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks#Notes

Comment 46 Nemo 2014-09-25 20:32:44 UTC

(In reply to Liangent from comment #0)
> Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks
> to avoid breakage.

(In reply to Gabriel Wicke from comment #2)
> It would be good to look into implementing phase 1 (recognize and protect
> language conversion content).

I understand this has some value in itself for PDF export, see bug 34919 comment 17.

Comment 47 Nemo 2014-10-17 15:27:25 UTC

(In reply to Nemo from comment #46)
> I understand this has some value in itself for PDF export, see bug 34919
> comment 17.

And more were filed, like bug 71815. Should they all depend from this?

Does this really depend on bug 43547? Maybe this should just be converted to a tracking bug so that we're free to add dependencies without hairsplitting.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links