Last modified: 2014-10-21 12:50:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70490, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68490 - Consider using content language for "html lang", rather than interface language
Consider using content language for "html lang", rather than interface language
Status: NEW
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
unspecified
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-24 00:16 UTC by Nemo
Modified: 2014-10-21 12:50 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2014-07-24 00:16:36 UTC
0) Open Chromium and, in one tab, chrome://translate-internals/#detection-logs
1) Visit https://fi.wikipedia.org/wiki/Opiskelijaraha?uselang=it and https://fi.wikipedia.org/wiki/Wikipedia:Etusivu?uselang=it (or set interface language to "it" and drop the uselang), save their HTML and alter '<html lang="it"' to '<html lang="fi"', open the HTML in the same browser.

I. Expected: the article content is recognised as "fi" locale and appropriately translated (or not) by Chromium according to my preferences.
II. Observed: the result is rather random depending on the amount of text in content vs. interface, but in general the interface language prevails because it's in the general html lang attribute. In the log something like this can be seen:

[
    {
        "adopted_language": "it",
        "cld_language": "it",
        "content_language": "fi",
        "html_root_language": "it",
        "is_cld_reliable": true,
        "time": 1406157710650.132,
        "url": "https://fi.wikipedia.org/wiki/Opiskelijaraha?uselang=it"
    },
    {
        "adopted_language": "und",
        "cld_language": "it",
        "content_language": "",
        "html_root_language": "fi",
        "is_cld_reliable": true,
        "time": 1406158649658.846,
        "url": "http://koti.kapsi.fi/~federico/tmp/Opiskelijaraha-it.html"
    },
    {
        "adopted_language": "und",
        "cld_language": "fi",
        "content_language": "fi",
        "html_root_language": "it",
        "is_cld_reliable": true,
        "time": 1406159280979.992,
        "url": "https://fi.wikipedia.org/wiki/Wikipedia:Etusivu?uselang=it"
    },
    {
        "adopted_language": "und",
        "cld_language": "en",
        "content_language": "",
        "html_root_language": "fi",
        "is_cld_reliable": true,
        "time": 1406159369076.235,
        "url": "http://koti.kapsi.fi/~federico/tmp/Etusivu-it.html"
    }
]

See also https://code.google.com/p/chromium/issues/detail?id=254330#c6 and https://code.google.com/p/chromium/issues/detail?id=95394#c20 ; per https://code.google.com/p/chromium/issues/detail?id=95394#c6 it doesn't look likely that Chromium will be able to recognise language of page fragments any time soon.

Yet, we have some optimistic "lang" tagging like:

<html lang="it" dir="ltr" class="client-nojs">
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-Opiskelijaraha skin-monobook action-view">
<div id="globalWrapper">
		<div id="column-content">
			<div id="content" class="mw-body-primary" role="main">
				<div id="bodyContent" class="mw-body">
					<div id="contentSub" lang="it" dir="ltr"></div>
					<!-- start content -->
					<div id="mw-content-text" lang="fi" dir="ltr" class="mw-content-ltr">
					PAGE IS HERE
					</div>
				</div>
			</div>
		</div>
		<div id="column-one" lang="it" dir="ltr">
		BASICALLY ALL INTERFACE
		</div>
</body></html>

The interface is correctly tagged as "it" and the content as "fi", but they're both wrapped in a (false) html lang which trumps them.
Comment 1 Niklas Laxström 2014-07-24 12:01:27 UTC
(In reply to Nemo from comment #0)
> The interface is correctly tagged as "it" and the content as "fi", but
> they're both wrapped in a (false) html lang which trumps them.

If that is happening somewhere, it is broken. The attribute closest to the content wins. That's how HTML works.
Comment 2 Nemo 2014-07-24 12:45:24 UTC
(In reply to Niklas Laxström from comment #1)
> The attribute closest to the
> content wins. That's how HTML works.

Thanks, I found a source: http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.2

Niklas pointed out that, if we removed or changed the general html lang attribute, then we could have some inner element with lang attribute missing (hence inherited incorrect). Chasing them one by one is not feasible.

Before filing this bug I failed to find one without its specific language information but it would need more inspection also with other skins and extensions. Maybe one of the "broad" divs can be used to tag all such items with less broad effects.

P.s.: The bug was tested with 35.0.1916.153 Russian Fedora (274914); the upstream links show they're aware of the problem (which may or not be filed clearly though) but don't plan to do something about it soon. I'll try to understand if the issue and use case are clear for them.
Comment 3 Niklas Laxström 2014-10-21 12:50:04 UTC
I'm not sure whether this is something we want to try to fix or just leave it to the upstream to figure out. Leaving open with lowest priority for now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links