Last modified: 2014-05-01 08:35:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48455, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46455 - Verify that only valid languages are accepted
Verify that only valid languages are accepted
Status: RESOLVED DUPLICATE of bug 37459
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Wikidata bugs
: i18n
: 44379 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-22 15:19 UTC by jeblad
Modified: 2014-05-01 08:35 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description jeblad 2013-03-22 15:19:52 UTC
In the API modules for Wikibase there is a entry for valid languages, and it is defined like this

languages           - By default the internationalized values are returned in all available languages. This parameter allows filtering these down to one or more languages by providing one or more language codes.

Values (separate with '|'): aa, ab, ace, af, ak, aln, als, am, an, ang, anp, ar, arc, arn, ary, arz, as, ast, av, avk, ay, az,
azb, ba, bar, bat-smg, bcc, bcl, be, be-tarask, be-x-old, bg, bh, bho, bi, bjn, bm, bn, bo, bpy,
bqi, br, brh, bs, bug, bxr, ca, cbk-zam, cdo, ce, ceb, ch, cho, chr, chy, ckb, co, cps, cr, crh,
crh-latn, crh-cyrl, cs, csb, cu, cv, cy, da, de, de-at, de-ch, de-formal, diq, dsb, dtp, dv, dz, ee,
egl, el, eml, en, en-ca, en-gb, eo, es, et, eu, ext, fa, ff, fi, fit, fiu-vro, fj, fo, fr, frc, frp,
frr, fur, fy, ga, gag, gan, gan-hans, gan-hant, gd, gl, glk, gn, got, grc, gsw, gu, gv, ha, hak,
haw, he, hi, hif, hif-latn, hil, ho, hr, hsb, ht, hu, hy, hz, ia, id, ie, ig, ii, ik, ike-cans,
ike-latn, ilo, inh, io, is, it, iu, ja, jam, jbo, jut, jv, ka, kaa, kab, kbd, kbd-cyrl, kg, khw, ki,
kiu, kj, kk, kk-arab, kk-cyrl, kk-latn, kk-cn, kk-kz, kk-tr, kl, km, kn, ko, ko-kp, koi, kr, krc,
kri, krj, ks, ks-arab, ks-deva, ksh, ku, ku-latn, ku-arab, kv, kw, ky, la, lad, lb, lbe, lez, lfn,
lg, li, lij, liv, lmo, ln, lo, loz, lt, ltg, lus, lv, lzh, lzz, mai, map-bms, mdf, mg, mh, mhr, mi,
min, mk, ml, mn, mo, mr, mrj, ms, mt, mus, mwl, my, myv, mzn, na, nah, nan, nap, nb, nds, nds-nl,
ne, new, ng, niu, nl, nl-informal, nn, no, nov, nrm, nso, nv, ny, oc, om, or, os, pa, pag, pam, pap,
pcd, pdc, pdt, pfl, pi, pih, pl, pms, pnb, pnt, prg, ps, pt, pt-br, qu, qug, rgn, rif, rm, rmy, rn,
ro, roa-rup, roa-tara, ru, rue, rup, ruq, ruq-cyrl, ruq-latn, rw, sa, sah, sat, sc, scn, sco, sd,
sdc, se, sei, sg, sgs, sh, shi, shi-tfng, shi-latn, si, simple, sk, sl, sli, sm, sma, sn, so, sq,
sr, sr-ec, sr-el, srn, ss, st, stq, su, sv, sw, szl, ta, tcy, te, tet, tg, tg-cyrl, tg-latn, th, ti,
tk, tl, tly, tn, to, tokipona, tpi, tr, tru, ts, tt, tt-cyrl, tt-latn, tum, tw, ty, tyv, udm, ug,
ug-arab, ug-latn, uk, ur, uz, ve, vec, vep, vi, vls, vmf, vo, vot, vro, wa, war, wo, wuu, xal, xh,
xmf, yi, yo, yue, za, zea, zh, zh-classical, zh-cn, zh-hans, zh-hant, zh-hk, zh-min-nan, zh-mo,
zh-my, zh-sg, zh-tw, zh-yue, zu

In this list there are entries that should not be used for language specific entries, like the Norwegian (no) entry. This is a metalanguage for Bokmål (nb) and Nynorsk (nn). I guess there are several others that is also wrong. Some of them could be redirected, but some should not be used at all. If the entry is used as a site-prefix, or in the site id, we should probably set up a redirect even if it is not strictly correct.
Comment 1 jeblad 2013-03-22 15:23:52 UTC
Note that this is mainly about removing invalid entries from our own use of the languages, not about a more general solution. Its kind of a stop gap solution to avoid data being uploaded for non-existing languages.
Comment 2 jeblad 2013-03-22 15:28:59 UTC
Note that this comes from Utils::getLanguageCodes which uses \Language::fetchLanguageNames() which only creates a list of valid names, but says nothing about function.
Comment 3 Jon Harald Søby 2013-03-22 16:15:00 UTC
From the top of my head, these should not be used at all:
* simple (labels and descriptions on Wikidata should all be pretty simple by default; more of a community/policy issue, but technically I don't see the need for separate simple and en descriptions. But should be discussed by the community on Wikidata)
* tokipona (why is this still available at all?)
* ug (Wikipedia uses both Arab and Latin script, labels/descriptions should specify which one is used)



And these should be redirects:
* als -> gsw (correct language code)
* bat-smg -> sgs (correct language code)
* be-x-old -> be-tarask
* fiu-vro -> vro (correct language code)
* no -> nb (for legacy Wikipedia reasons)
* roa-rup -> rup (correct language code)
* zh-classical -> lzh (correct language code)
* zh-min-nan -> nan (correct language code)
* zh-yue -> yue (correct language code)


Borderline cases, not sure which way these should redirect:
* hif-latn <-> hif (Wikipedia seems to only use Latin script)
* kbd-cyrl <-> kbd (Wikipedia seems to only use Cyrillic script)
* ku-latn <-> ku (Wikipedia seems to only use Latin script)
* tt-cyrl <-> tt (Wikipedia seems to only use Cyrillig script)


Not sure:
* crh / crh-latn / crh-cyrl: probably crh should redirect to crh-latn, or the other way around. crhwiki seems to use Latin only.
* gan / gan-hans / gan-hant: some sort of automatic conversion should be made available on Wikidata, as with other languages written in Han script(s)
* kk / kk-arab / kk-cyrl / kk-latn / kk-cn / kk-kz / kk-tr: the Wikipedia has automatic conversion (from Latin?), should be made available on Wikidata.
* ks / ks-arab / ks-deva: ks should probably redirect to ks-arab, or the other way around. kswiki seems to use Arabic only. Probably no automatic conversion available.
* ruq / ruq-cyrl / ruq-latn: maybe ruq should be disabled and only ruq-cyrl and ruq-latn accepted as inputs?
* shi / shi-tfng / shi-latn: don't know which one is more common
* sr / sr-ec / sr-el: same as above
* tg / tg-cyrl / tg-latn: automatic conversion?
* zh and variants (except those mentioned earlier): automatic conversion exists on Wikipedia, should be reuseable on Wikidata

In most of these "Not sure" cases, the main language code should probably be disabled, and input should be specifically in either of the two/more variants, though the presence of automatic conversion for some may make things a bit more complicated.
Comment 4 Nemo 2013-04-05 08:36:25 UTC
*** Bug 44379 has been marked as a duplicate of this bug. ***
Comment 5 Nikola Smolenski 2013-04-05 08:54:29 UTC
I don't see why would various scripts redirect to each other, this should rather be done using language fallback (bug 36430).
Comment 6 Nemo 2013-04-07 00:30:20 UTC

*** This bug has been marked as a duplicate of bug 37459 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links