Last modified: 2014-09-13 17:08:06 UTC
I think I found two intricated "bugs": == Wikipedia accept invalid URI in HTTP requests == According to some URI's RFC like 2396, 3986: "A URI is a sequence of characters from a very limited set, i.e. the letters of the basic Latin alphabet, digits, and a few special characters." I'm aware of URI variants like IRI allowing any byte sequences, BUT the HTTP RFC specifies that HTTP accepts URIs, not IRIs. This does NOT render IRI useless, we still can use IRI on browsers, whose role is to convert to valid URIs (With the knowledge of the local encoding). So this may fail, typically with a 400 bad request, instead of returning a 200 OK: $ curl -si http://ar.wikipedia.org/wiki/حب | grep 'canonical\|HTTP/1.1' HTTP/1.1 200 OK <link rel="canonical" href="http://ar.wikipedia.org/wiki/حب" /> But I think if Wikipedia returns a 200, there may be a reason, and I think this ticket is a good opportunity do document it. == Due to previous bug, Wikipedia have the same page behind two different URIs with two different rel-canonical == $ urlencode 'حب' %D8%AD%D8%A8 $ curl -si http://ar.wikipedia.org/wiki/%D8%AD%D8%A8 | grep 'canonical\|HTTP/1.1' HTTP/1.1 200 OK <link rel="canonical" href="http://ar.wikipedia.org/wiki/%D8%AD%D8%A8" /> And I think this one is typically not normal, rel canonical should be I think set to the encoded (valid) form when requesting the invalid URI, if no 400 is given.