Last modified: 2014-08-29 22:34:39 UTC
Mediawiki Version 1.23.2 Scribunto Revision b4b099d Lua 5.1.4 Using mw.html.create(...) to make a mw.html object, I am able to use most of the functions just fine. However, when I use the following code in a module from the example in the Lua Reference Manual... >local div = mw.html.create( 'div' ) >div > :attr( 'id', 'testdiv' ) > :css( 'width', '100%' ) > :wikitext( "Hello, world! I am a \"" .. (args["type"] or "...nothing...") .. "\"" ) > :tag( 'hr' ) ...I get the following error: "Lua error in mw.html.lua at line 113: attempt to concatenate a nil value." With some exploration, I discovered the error is produced by this line (edited to stand alone): > div:css( 'width', '100%' ) I also receive the same error with a table input: > div:css( {['width'] = '100%'} ) ...and with numbers: > div:css( 'width', 100 )
I cannot reproduce this locally. Can you dig into mw.html.lua and try to determine what exactly is going wrong?
I am able to reproduce this. I'm investigating it now.
I tested this on Windows, since it appears that's where the reporter is, since the only place we provide 5.1.4 is Windows. Of particular note is the following: Warning: preg_replace_callback(): Compilation failed: invalid UTF-8 string at offset 14 in C:\xampp\htdocs\core\extensions\Scribunto\engines\LuaCommon\UstringLibrary.php on line 643 Backporting https://gerrit.wikimedia.org/r/#/c/145414/ fixes the issue for me (and makes that warning go away).
After further investigation, the problem seems to lie in the patternToRegex function. A simple test: The pattern "[
(and it seems Bugzilla doesn't like that character either. Where I say U+10FFFF below, I really mean that literal character.) After further investigation, the problem seems to lie in the patternToRegex function. A simple test: The pattern "[U+10FFFF]" is incorrectly converted to "/\[U+10FFFF\]/us" instead of the correct "/[U+10FFFF]/us". (I'm not sure why this is happening, or why that makes PHP throw a warning.)
It seems clear that PHP-on-Windows is doing something screwed up there; I see no code path in engines/LuaCommon/UstringLibrary.php that should ever convert "[U+10FFFF]" to a regex with backslashed brackets like that. If I had access to a machine where this was occurring, I'd be hacking patternToRegex() (in a local copy) to log various variables at various points in the function to see where exactly things are going wrong. Starting with whether $pattern is correct, and then whether $pat is getting set correctly. BTW, the issue with bugzilla cutting off the comment appears to be https://bugzilla.mozilla.org/show_bug.cgi?id=405011 (which is marked fixed, but apparently only in their 5.0 branch?).
Confirmed, this was reproduced on a Windows machine.
The difference happens when preg_split is called. var_dump( preg_split( '//us', "[\xF4\x8F\xBF\xBF]", null, PREG_SPLIT_NO_EMPTY ) ); On Linux: array(3) { [0] => string(1) "[" [1] => string(4) "U+10FFFF" [2] => string(1) "]" } On Windows: array(1) { [0]=> string(6) "[U+10FFFF]" }
That happens because U+10FFFF is considered an invalid character on Windows but not Linux. preg_match("/\xF4\x8F\xBF\xBF/u", "foobar"); produces "Warning: preg_match(): Compilation failed: invalid UTF-8 string at offset 0 in - on line 2" on Windows but not Linux.
Thanks for tracking that down. At this point it doesn't seem to be a bug Scribunto, since U+10FFFF is a valid Unicode code point. Further questions: * Does it affect other characters (e.g. U+10FFFE)? Ideally test every character, if that wouldn't take insanely long. * Is the Windows environment is using an old version of PCRE? Would upgrading fix it?
Linux and Windows both say that U+D800 through U+DFFF are invalid. Linux doesn't complain about anything else, but Windows also complains about U+FDD0 through U+FDEF, as well as all codepoints that end in FFFE or FFFF.
Windows PCRE version: 8.32 2012-11-30 Linux PCRE version: 8.31 2012-07-06
(In reply to Jackmcbarn from comment #11) > Linux and Windows both say that U+D800 through U+DFFF are invalid. That's not terribly surprising. UTF-16 surrogates aren't needed in UTF-8. > but Windows also complains about > U+FDD0 through U+FDEF, as well as all codepoints that end in FFFE or FFFF. So it's complaining about non-characters even though they're valid code points. (In reply to Jackmcbarn from comment #12) > Windows PCRE version: 8.32 2012-11-30 > Linux PCRE version: 8.31 2012-07-06 Looking through the PCRE changelog, I see the following for 8.33: > 21. Unicode validation has been updated in the light of Unicode Corrigendum #9, > which points out that "non characters" are not "characters that may not > appear in Unicode strings" but rather "characters that are reserved for > internal use and have only local meaning". This leads me to http://bugs.exim.org/show_bug.cgi?id=1340, which confirms that this was a change in 8.32 that was fixed in 8.33. About the only thing we could do in Scribunto would be to blacklist PCRE 8.32, throwing an exception if that version is detected. I don't know that we really want to go to that extent. I did update the documentation at https://www.mediawiki.org/wiki/Extension:Scribunto#PCRE_version_compatibility to note this issue.
Since this is a bug, but not our bug, changing to INVALID. As a workaround for this bug, you can cherry-pick https://gerrit.wikimedia.org/r/#/c/145414/ Running the command "git cherry-pick ee289c8045ea1cc260c56d447af3078d5f6075cd" from the Scribunto directory should do this, assuming it was checked out of git. Upgrading to 1.24 will include that change, so once you eventually do that, you won't have to worry about this particular instance of it anymore.