Last modified: 2014-11-13 19:28:51 UTC
The valid chars in titles are configurable in MediaWiki via the $wgLegalTitleChars global. We should expose this through the API (general section I presume), and use it in our tokenizer to recognize the relevant chars.
See https://gerrit.wikimedia.org/r/60852 which fixed our link parser to recognize the default $wgLegalTitleChars. The PHP parser uses this regexp to match links at Parser.php around line 1780: $tc = Title::legalChars() . '#%'; # Match a link having the form [[namespace:link|alternate]]trail $e1 = "/^([{$tc}]+)(?:\\|(.+?))?]](.*)\$/sD"; Where Title::legalChars() just returns $wgLegalTitleChars.
Change 173051 had a related patch set uploaded by Arlolra: (Bug 47651) Expose legalTitleChars through the API https://gerrit.wikimedia.org/r/173051
Change 173051 had a related patch set uploaded by Arlolra: Expose legaltitlechars through the API https://gerrit.wikimedia.org/r/173051
Change 173051 merged by jenkins-bot: Expose legaltitlechars through the API https://gerrit.wikimedia.org/r/173051