Last modified: 2013-04-09 17:24:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38839, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 36839 - Problem with non-ASCII characters and long requests
Problem with non-ASCII characters and long requests
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.20.x
All All
: Highest critical (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n
: 36799 37021 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-14 17:50 UTC by NicoV
Modified: 2013-04-09 17:24 UTC (History)
20 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description NicoV 2012-05-14 17:50:10 UTC
frwiki says Version = 1.20wmf2 on May 14
A few days ago, it was working correctly.

API seems to have trouble with page titles with non-ASCII characters when the size of the API request is very long.

For example, a query request for templates (prop=templates) returns incorrect results for page titles with non-ASCII characters, when the list of pages (argument "titles=") contains several dozen pages : the API answer returns incorrect page names when there are non-ASCII characters (é instead of é, ...) and incorrect information (it doesn't use the correct page title).

But if the same request is made with a smaller list of pages, the API answer becomes correct again.

Request with incorrect answer
=============================

http://fr.wikipedia.org/w/api.php?prop=templates&action=query&titles=Acteur%7CAlbert%20Robbins%7CAnglais%7CAnn%20Donahue%7CAnthony%20E.%20Zuiker%7CCarol%20Mendelsohn%7CCatherine%20Willows%7CDavid%20Hodges%7CDavid%20Phillips%7CGil%20Grissom%7CGreg%20Sanders%7CHodges%7CInternet%20Movie%20Database%7CJim%20Brass%7CLady%20Heather%7CLes%20Experts%20(s%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e)%7CLes%20Experts%20:%20Manhattan%7CLes%20Experts%20:%20Miami%7CListe%20des%20personnages%20des%20Experts%7CListe%20des%20%C3%A9pisodes%20des%20Experts%7CMod%C3%A8le%20discussion:Palette%20Les%20Experts%7CNick%20Stokes%7CPersonnage%20de%20fiction%7CPersonnage%20fictif%7CPersonnage%20de%20fiction%7CPersonnages%20r%C3%A9currents%20dans%20Les%20Experts%7CRaymond%20Langston%7CRiley%20Adams%7CSaison%201%20des%20Experts%7CSaison%2010%20des%20Experts%7CSaison%2011%20des%20Experts%7CSaison%2012%20des%20Experts%7CSaison%202%20des%20Experts%7CSaison%203%20des%20Experts%7CSaison%204%20des%20Experts%7CSaison%205%20des%20Experts%7CSaison%206%20des%20Experts%7CSaison%207%20des%20Experts%7CSaison%208%20des%20Experts%7CSaison%209%20des%20Experts%7CSara%20Sidle%7CSofia%20Curtis%7CS%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e%7CWallace%20Langham%7CWarrick%20Brown%7CWendy%20Simms%7C%C3%89tats-Unis

Answer :
<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page ns="0" title="Les Experts (série télévisée)" missing="" />
      <page ns="0" title="Liste des épisodes des Experts" missing="" />
      <page ns="0" title="Modèle discussion:Palette Les Experts" missing="" />
      <page ns="0" title="Personnages récurrents dans Les Experts" missing="" />
      <page ns="0" title="Série télévisée" missing="" />
      <page ns="0" title="Wallace Langham" missing="" />
      <page ns="0" title="États-Unis" missing="" />
      <page pageid="43579" ns="0" title="Acteur" />


Request with correct answer
===========================

http://fr.wikipedia.org/w/api.php?prop=templates&action=query&titles=Acteur%7CAlbert%20Robbins%7CAnglais%7CAnn%20Donahue%7CAnthony%20E.%20Zuiker%7CCarol%20Mendelsohn%7CCatherine%20Willows%7CDavid%20Hodges%7CDavid%20Phillips%7CGil%20Grissom%7CGreg%20Sanders%7CHodges%7CInternet%20Movie%20Database%7CJim%20Brass%7CLady%20Heather%7CLes%20Experts%20(s%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e)%7CLes%20Experts%20:%20Manhattan%7CLes%20Experts%20:%20Miami%7CListe%20des%20personnages%20des%20Experts%7CListe%20des%20%C3%A9pisodes%20des%20Experts%7CMod%C3%A8le%20discussion:Palette%20Les%20Experts%7CPersonnages%20r%C3%A9currents%20dans%20Les%20Experts%7CS%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e%7C%C3%89tats-Unis

Answer :
<api>
  <query>
    <pages>
      <page ns="0" title="Modèle discussion:Palette Les Experts" missing="" />
      <page pageid="43579" ns="0" title="Acteur" />
      <page pageid="4307505" ns="0" title="Albert Robbins" />
      <page pageid="5539" ns="0" title="Anglais">
Comment 1 Alex Monk 2012-05-14 18:00:04 UTC
Confirmed, upped the severity of this bug.

According to OP, this problem seems to have started in the past few days - maybe it was introduced in 1.20wmf2?
Comment 2 Umherirrender 2012-05-14 18:38:42 UTC
Are POST Request also affected, where titles= is part of the POST data?
Comment 3 Brad Jorsch 2012-05-14 19:20:38 UTC
It appears that "very long" here corresponds to "512 or more bytes in the parameter". It does not seem to have anything to do with the number of titles, the total length of all parameters, or the number of characters in the URL-encoded representation.
Comment 4 NicoV 2012-05-14 19:50:14 UTC
The problem seems to be more general than the simple example I provided first.

For example, on frwiki, if I try to retrieve the list of templates in Hacker, it orks :
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&titles=Hacker

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page pageid="1830125" ns="0" title="Hacker">
        <templates>
          <tl ns="10" title="Modèle:Autres projets" />
          <tl ns="10" title="Modèle:Homonymie" />
        </templates>
      </page>
    </pages>
  </query>
</api>

If I had a tltemplates argument with only "Modèle:Homonymie", it works also :
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AHomonymie&titles=Hacker

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page pageid="1830125" ns="0" title="Hacker">
        <templates>
          <tl ns="10" title="Modèle:Homonymie" />
        </templates>
      </page>
    </pages>
  </query>
</api>

But, when I put all disambiguation templates in the tltemplates argument, the result is wrong (Modèle:Homonymie is not returned as one of the templates in Hacker) :
/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AArrondissements%20homonymes%7CMod%C3%A8le%3ABandeau%20standard%20pour%20page%20d'homonymie%7CMod%C3%A8le%3ABatailles%20homonymes%7CMod%C3%A8le%3ACantons%20homonymes%7CMod%C3%A8le%3ACommunes%20fran%C3%A7aises%20homonymes%7CMod%C3%A8le%3AFilms%20homonymes%7CMod%C3%A8le%3AGouvernements%20homonymes%7CMod%C3%A8le%3AGuerres%20homonymes%7CMod%C3%A8le%3AHomonymie%7CMod%C3%A8le%3AHomonymie%20bateau%7CMod%C3%A8le%3AHomonymie%20d'%C3%A9tablissements%20scolaires%20ou%20universitaires%7CMod%C3%A8le%3AHomonymie%20d'%C3%AEles%7CMod%C3%A8le%3AHomonymie%20de%20clubs%20sportifs%7CMod%C3%A8le%3AHomonymie%20de%20comt%C3%A9s%7CMod%C3%A8le%3AHomonymie%20de%20monument%7CMod%C3%A8le%3AHomonymie%20de%20nom%20romain%7CMod%C3%A8le%3AHomonymie%20de%20parti%20politique%7CMod%C3%A8le%3AHomonymie%20de%20route%7CMod%C3%A8le%3AHomonymie%20dynastique%7CMod%C3%A8le%3AHomonymie%20vid%C3%A9oludique%7CMod%C3%A8le%3AHomonymie%20%C3%A9difice%20religieux%7CMod%C3%A8le%3AInternationalisation%7CMod%C3%A8le%3AIsom%C3%A9rie%7CMod%C3%A8le%3AParonymie%7CMod%C3%A8le%3APatronyme%7CMod%C3%A8le%3APatronyme%20basque%7CMod%C3%A8le%3APatronyme%20italien%7CMod%C3%A8le%3APatronymie%7CMod%C3%A8le%3APersonnes%20homonymes%7CMod%C3%A8le%3ASaints%20homonymes%7CMod%C3%A8le%3ATitres%20homonymes%7CMod%C3%A8le%3AToponymie%7CMod%C3%A8le%3AUnit%C3%A9s%20homonymes%7CMod%C3%A8le%3AVilles%20homonymes%7CMod%C3%A8le%3A%C3%89difices%20religieux%20homonymes&titles=Hacker

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page pageid="1830125" ns="0" title="Hacker" />
    </pages>
  </query>
</api>
Comment 5 NicoV 2012-05-14 19:51:08 UTC
Link to last example was:
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AArrondissements%20homonymes%7CMod%C3%A8le%3ABandeau%20standard%20pour%20page%20d'homonymie%7CMod%C3%A8le%3ABatailles%20homonymes%7CMod%C3%A8le%3ACantons%20homonymes%7CMod%C3%A8le%3ACommunes%20fran%C3%A7aises%20homonymes%7CMod%C3%A8le%3AFilms%20homonymes%7CMod%C3%A8le%3AGouvernements%20homonymes%7CMod%C3%A8le%3AGuerres%20homonymes%7CMod%C3%A8le%3AHomonymie%7CMod%C3%A8le%3AHomonymie%20bateau%7CMod%C3%A8le%3AHomonymie%20d'%C3%A9tablissements%20scolaires%20ou%20universitaires%7CMod%C3%A8le%3AHomonymie%20d'%C3%AEles%7CMod%C3%A8le%3AHomonymie%20de%20clubs%20sportifs%7CMod%C3%A8le%3AHomonymie%20de%20comt%C3%A9s%7CMod%C3%A8le%3AHomonymie%20de%20monument%7CMod%C3%A8le%3AHomonymie%20de%20nom%20romain%7CMod%C3%A8le%3AHomonymie%20de%20parti%20politique%7CMod%C3%A8le%3AHomonymie%20de%20route%7CMod%C3%A8le%3AHomonymie%20dynastique%7CMod%C3%A8le%3AHomonymie%20vid%C3%A9oludique%7CMod%C3%A8le%3AHomonymie%20%C3%A9difice%20religieux%7CMod%C3%A8le%3AInternationalisation%7CMod%C3%A8le%3AIsom%C3%A9rie%7CMod%C3%A8le%3AParonymie%7CMod%C3%A8le%3APatronyme%7CMod%C3%A8le%3APatronyme%20basque%7CMod%C3%A8le%3APatronyme%20italien%7CMod%C3%A8le%3APatronymie%7CMod%C3%A8le%3APersonnes%20homonymes%7CMod%C3%A8le%3ASaints%20homonymes%7CMod%C3%A8le%3ATitres%20homonymes%7CMod%C3%A8le%3AToponymie%7CMod%C3%A8le%3AUnit%C3%A9s%20homonymes%7CMod%C3%A8le%3AVilles%20homonymes%7CMod%C3%A8le%3A%C3%89difices%20religieux%20homonymes&titles=Hacker
Comment 6 Brad Jorsch 2012-05-14 20:05:20 UTC
Probably the same underlying problem: it's searching for "Modèle:Homonymie" instead of "Modèle:Homonymie". Trim the value of tltemplates down to 511 bytes or less (before URL-encoding) and it'll probably work. Or try it with a POST, as Umherirrender suggested.
Comment 7 NicoV 2012-05-14 20:21:09 UTC
I just tried with POST instead of GET : it works.
So the problem seems to appear only with GET requests.
Comment 8 Mark A. Hershberger 2012-05-15 12:57:10 UTC
Bumping to highest as this appears to be a regression with the recent rollout.  Notifying Robla and Sumana.
Comment 9 Alex Monk 2012-05-15 14:51:59 UTC
*** Bug 36799 has been marked as a duplicate of this bug. ***
Comment 10 Brad Jorsch 2012-05-15 16:40:04 UTC
I think I might have figured this out.

In a post on enwiki from May 11,[1] we are told that Roan changed the "PCRE recursion limit" from the default 100k to 1k. I assume this is referring to PHP's "pcre.recursion_limit" setting,[2] which indeed has a default of 100000.

One thing the recursion limit affects how often regexes with subexpressions like "(x)+" can match. It seems that each match by "+" there uses up 2 of the recursion limit; with a value of 1024, it can match at most 511 times. If it would match 512 times, preg_match will return false instead. You can test this easily enough if you have a recent-enough command-line PHP:

  php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/", str_repeat("x", 511)));'
  php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/", str_repeat("x", 512)));'

The first will succeed, while the second will fail. But if you bump the 1024 to 1026, the second will start working.

So what seems to be going on is this: The API uses the methods in WebRequest to get the parameters from the client, all of which seem to come down to getGPCVal. For any parameter that exists in $_GET (even if overridden by $_POST), getGPCVal passes the value through Language::checkTitleEncoding to make sure it's valid UTF-8. And due to the low recursion limit, the regex in Language::checkTitleEncoding that tries to check whether the value is valid UTF-8 will now think it is ''not'' valid if the value is more than 511 characters long, so it will treat it as the fallback 8-bit encoding (windows-1252 for most languages), which gives the familiar "è" mojibake.

If I'm right, the fix for this bug would be to revert Roan's change to the "pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in some other way), or at least turn it up to something more reasonable than 1024. I'd expect this is causing problems in other areas of the code, too.

 [1]: https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Database_reports&diff=491927371&oldid=491919743
 [2]: http://us.php.net/manual/en/pcre.configuration.php#ini.pcre.recursion-limit
Comment 11 Sumana Harihareswara 2012-05-15 18:33:25 UTC
Keywording i18n.
Comment 12 Umherirrender 2012-05-18 18:44:29 UTC
(In reply to comment #10)
> I'd expect this is causing problems in other areas of the code, too.

And it causing problems for many clients, which using titles= with 50 urlencoded titles. Please use a temp fix soon (reset the limit) and look into the real problem after that.

Thanks.
Comment 13 Lupo 2012-05-22 07:37:03 UTC
The regexp recursion limit aside, is using a regexp to check for UTF-8 appropriate? Why not use mb_check_encoding() if available? Other operations in Language.php do make use of the mb_* functions...

http://php.net/manual/en/function.mb-check-encoding.php
Comment 14 Umherirrender 2012-05-25 14:26:13 UTC
*** Bug 37021 has been marked as a duplicate of this bug. ***
Comment 15 Mark A. Hershberger 2012-05-26 17:47:22 UTC
(In reply to comment #10)
> If I'm right, the fix for this bug would be to revert Roan's change to the
> "pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in
> some other way), or at least turn it up to something more reasonable than 1024.
> I'd expect this is causing problems in other areas of the code, too.

Sam, have you had a chance to look at this yet?
Comment 16 Sam Reed (reedy) 2012-05-27 12:26:18 UTC
(In reply to comment #15)
> (In reply to comment #10)
> > If I'm right, the fix for this bug would be to revert Roan's change to the
> > "pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in
> > some other way), or at least turn it up to something more reasonable than 1024.
> > I'd expect this is causing problems in other areas of the code, too.
> 
> Sam, have you had a chance to look at this yet?

Nope.

We can increase the pcre.recursion_limit again, the value Roan set was overly conservative.

I'm not sure what the availability of various Ops staff is going to be to push this through

(In reply to comment #13)
> The regexp recursion limit aside, is using a regexp to check for UTF-8
> appropriate? Why not use mb_check_encoding() if available? Other operations in
> Language.php do make use of the mb_* functions...
> 
> http://php.net/manual/en/function.mb-check-encoding.php

I'm guessing legacy reasons, and the code was never updated.

Please feel free to submit a patch to BZ or a commit to Gerrit. Or is just changing:

		$isutf8 = preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .
				'[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );
		if ( $isutf8 ) {
			return $s;
		}

to

		if ( mb_check_encoding( $s, 'UTF-8' )  ) {
			return $s;
		}

enough?

Language::checkTitleEncoding has no unit tests written for it either.
Comment 17 Sam Reed (reedy) 2012-05-31 10:50:09 UTC
For anyone not following along... Lupo created https://gerrit.wikimedia.org/r/#/c/9126/ and added some unit tests
Comment 18 Sam Reed (reedy) 2012-05-31 12:49:03 UTC
Code is now live on Wikimedia wikis. Both the regex fix, and the addition of mb_check_encoding() are there. Should be quite a bit quicker. Shouldn't need the recursion limit increasing again either
Comment 19 Brad Jorsch 2012-05-31 14:59:17 UTC
(In reply to comment #18)
> Shouldn't need the recursion limit increasing again either

Are you sure that this low recursion limit isn't breaking things elsewhere in the code too?
Comment 20 Sam Reed (reedy) 2012-05-31 15:58:58 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Shouldn't need the recursion limit increasing again either
> 
> Are you sure that this low recursion limit isn't breaking things elsewhere in
> the code too?

Nope. But leaving it low will bring them up, and maybe we can fix those aswell? ;)
Comment 21 Gerrit Notification Bot 2013-04-09 15:59:27 UTC
Related URL: https://gerrit.wikimedia.org/r/58306 (Gerrit Change I27203c767d1d3f2f0999b1b1d8a06e8cf68c19ed)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links