Last modified: 2014-08-25 21:17:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70196, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68196 - Lua error in an unprotected function can cause a longjmp to unwind HHVM's stack
Lua error in an unprotected function can cause a longjmp to unwind HHVM's stack
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Scribunto (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: hhvm
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-17 22:19 UTC by Brett Simmers
Modified: 2014-08-25 21:17 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
hhvm.256.io:~bsimmers/lua.log (812.79 KB, text/plain)
2014-07-17 22:25 UTC, Ori Livneh
Details
Reduced test case (541 bytes, text/x-php)
2014-07-24 00:07 UTC, Tim Starling
Details

Description Brett Simmers 2014-07-17 22:19:57 UTC
I'm not 100% sure this is a bug in the Lua extension but all signals are pointing at it.

Repro steps: Make sure the vps is running a Debug HHVM, then load http://hhvm.256.io/wiki/San_Francisco?action=purge. You'll get a 503 and HHVM will crash with: hhvm: .../hphp/runtime/vm/jit/unwind-x64.cpp:161: _Unwind_Reason_Code HPHP::JIT::{anonymous}::tc_unwind_personality(int, _Unwind_Action, uint64_t, _Unwind_Exception*, _Unwind_Context*): Assertion `status == 0' failed.

Commenting out Scribunto in LoadExtensions.php makes the crash go away, and the page renders with some broken templates as expected. I've done some tracing in HHVM and it looks like this could be related to an exception being thrown out of the extension. This log of unwinding activity right before the crash may point to some relevant functions: http://pastebin.com/BkNguanT
Comment 1 Brett Simmers 2014-07-17 22:21:54 UTC
~bsimmers/lua.log has a much more useful-looking stacktrace than what we were previously seeing.
Comment 2 Ori Livneh 2014-07-17 22:25:28 UTC
Created attachment 15964 [details]
hhvm.256.io:~bsimmers/lua.log
Comment 3 Tim Starling 2014-07-18 01:13:19 UTC
The backtrace suggests https://github.com/facebook/hhvm/pull/3121 , which I see you don't have the fix for in /home/bsimmers/hhvm-dbg . When I parse that page in CLI mode using /usr/local/bin/hhvm, I reliably get an assertion from JIT::checkFrame:

hhvm: /srv/hhvm-dev/hphp/runtime/base/ref-data.h:118: HPHP::Cell* HPHP::RefData::tv(): Assertion `m_magic == Magic::kMagic' failed.

#4  0x0000000000cd81a1 in tv (this=<optimized out>) at /srv/hhvm-dev/hphp/runtime/base/ref-data.h:118
#5  HPHP::tvIsPlausible (tv=...) at /srv/hhvm-dev/hphp/runtime/base/tv-helpers.cpp:82
#6  0x00000000014bc05b in HPHP::JIT::checkFrame (fp=0x7fffdf97ea70, sp=<optimized out>, checkLocals=checkLocals@entry=true)
    at /srv/hhvm-dev/hphp/runtime/vm/jit/translator-runtime.cpp:715
#7  0x00000000014bdcd7 in HPHP::JIT::traceCallback (fp=0x7fffdf97ea70, sp=0x7fffdf97e920, pcOff=10060, rip=0x3ea11c0)
    at /srv/hhvm-dev/hphp/runtime/vm/jit/translator-runtime.cpp:728
#8  0x0000000003ea11d9 in ?? ()
Comment 4 Brett Simmers 2014-07-18 01:18:32 UTC
I tried with that applied earlier today and it still crashed (I've been messing with the contents of that working dir recently). I guess I didn't notice that it was a different crash. How can I parse a page in CLI mode?
Comment 5 Ori Livneh 2014-07-18 01:23:09 UTC
(In reply to Brett Simmers from comment #4)
> How can I parse a page in CLI mode?

cd /srv/mediawiki
php maintenance/eval.php
> $title = Title::newFromText('San Francisco'); $text = Revision::newFromTitle($title)->getText(); $wgParser->parse($text, $title, new ParserOptions);
Comment 6 Tim Starling 2014-07-18 01:27:36 UTC
Yes, that. There are various other ways to do it, but that one lets you modify the input text before you run it, which is often useful for isolating parser-related bugs. However, I haven't found any reduced input text yet -- it seems to depend on execution history.
Comment 7 Tim Starling 2014-07-21 06:49:22 UTC
It is asserting while checking local number 7 of PPFrame_Hash::expand().
Comment 8 Tim Starling 2014-07-22 05:04:01 UTC
...that being $iteratorNode. The TypedValue* for $iteratorNode was consistent enough for me to set a watch on it, which showed it being overwritten with garbage by translator-asm-helpers.S, probably the stack write on line 52. That seemed pretty JIT-specific, so I tried running it without the JIT, and it instead asserted in  the ~ScopeGuardImpl of ExecutionContext::invokeFunc(), when invoking an unnamed function (presumably the eval.php command line).

Before the write by translator-asm-helpers.S, $iteratorNode had a PPNode_Hash_Array in it, as expected. There's nothing obviously linking this to Scribunto, at present.
Comment 9 Brett Simmers 2014-07-23 05:30:37 UTC
After some fun times with gdb, I'm almost certain this is luasandbox related, though I'm not yet sure what the root cause is. I was seeing what looked like some C++ frames disappearing without the relevant destructors being called, so I suspected something was going wrong with a longjmp somewhere and went with that.

http://pastebin.com/mKeqJYLZ is a stacktrace I took right before everything went off the rails. Notice that ExecutionContext::invokeFunc and luasandbox_call_helper both appear twice, indicating that PHP called into lua, which called back into PHP, which is now calling into lua again (this is expected behavior, right?). I previously stepped through one longjmp that didn't escape the call to luasandbox_call_helper in frame #5, but the one about to happen doesn't go so well. I stepped through this longjmp up until it restored the previous stack and took another backtrace: http://pastebin.com/6UgHygPs. Notice that it appears to have jumped from the nested invocation of lua to the outer one, skipping over all the hhvm frames in between the two. This explains how we eventually end up in the outmost invokeFunc frame with the VM state still looking like the nested frame: the nested VM state is supposed to be popped by a destructor in invokeFunc that never ran.

I'm done for today but the best theory I have so far is that there's some global setjmp/longjmp buffer in liblua that's being used improperly. Is liblua supposed to handle reentrancy like this? I noticed that lua_pushcclosure() is in the backtrace for the problematic longjmp but not for the previous one; maybe that's related?
Comment 10 Brett Simmers 2014-07-23 05:40:26 UTC
I don't know anything about the internals of the Zend engine. Is it possible that this is happening with PHP5 as well but everything happens to work out anyway? I'll try it tomorrow but it's possible that in a release build of HHVM without asserts on, things could mostly appear to work.
Comment 11 Tim Starling 2014-07-23 06:06:59 UTC
Nice work.  A Lua error in an unprotected function like lua_pushcclosure() could indeed cause a longjmp to unwind HHVM's stack. If Lua was not in the stack when this happened, it would call the panic hook (luasandbox_panic) and we would get an informative error message, like bug 59130. But it is in the stack, so it tries to longjmp back to the last lua_pcall().
Comment 12 Tim Starling 2014-07-23 07:06:01 UTC
Also, because Lua is in the stack, the in_lua flag is set, so "slop" is zero. That is to say, the hack intended to fix bug 59130 is disabled. The Lua userspace takes the usage to within 36 bytes of the limit, and lua_pushcclosure() tries to allocate 40, so after the error is handled, it's assured that re-entering the same state will cause an unprotected OOM.

in_lua and in_php are both flags indicating the contents of the stack, not the immediate caller. Maybe we need a third flag which indicates that the immediate caller is unprotected. 

Also, I think the 1 MB slop might be a bit too conservative. Since the result of an OOM is a crash, we should probably just disable the memory limit entirely for unprotected calls.

A third thing we should do is to compile our own Lua library in C++. This might not be convenient for all reusers, but it causes Lua to use exceptions for error propagation, instead of setjmp/longjmp, which should be safer.

#if defined(__cplusplus)
/* C++ exceptions */
#define LUAI_THROW(L,c)	throw(c)
#define LUAI_TRY(L,c,a)	try { a } catch(...) \
	{ if ((c)->status == 0) (c)->status = -1; }

Note that it does catch and discard all exceptions, even ones from HHVM -- I noticed this during design work for https://github.com/facebook/hhvm/pull/1986 . The Lua VM is not exception-safe, so it has a choice between discarding exceptions or crashing. That's one of the reasons why EZC catches and saves exceptions: to prevent them from breaking Lua. If we were integrating Lua as a non-EZC HHVM extension, we would still need that catch and save layer, equivalent to PR 1986.
Comment 13 Tim Starling 2014-07-24 00:07:53 UTC
Created attachment 16028 [details]
Reduced test case

I used a recursive table "a = {a,0}" to force Lua to do the small allocations necessary to reliably take the memory usage up to within a short distance of the limit. Then by having PHP attempt to re-enter Lua after the resulting OOM, a segfault reliably results.
Comment 14 Tim Starling 2014-07-24 03:25:07 UTC
On hhvm.256.io, I have installed a LuaSandbox statically linked against Lua compiled as C++. With this build, the reduced test case exits cleanly with "Fatal error: unknown exception". That's quite a nice failure mode, and deals with all Lua errors thrown from unprotected functions, not just OOMs.
Comment 15 Giuseppe Lavagetto 2014-07-24 15:23:19 UTC
(In reply to Tim Starling from comment #14)
> On hhvm.256.io, I have installed a LuaSandbox statically linked against Lua
> compiled as C++. With this build, the reduced test case exits cleanly with
> "Fatal error: unknown exception". That's quite a nice failure mode, and
> deals with all Lua errors thrown from unprotected functions, not just OOMs.

https://gerrit.wikimedia.org/r/#/c/149001 did the trick for the package, the last package version, hhvm-luasandbox 2.0-3 links against the c++ version of the library.

It is being tested now, AFAIK
Comment 16 Bryan Davis 2014-07-24 16:02:56 UTC
Cross post from https://bugzilla.wikimedia.org/show_bug.cgi?id=68413#c8 :

Updated beta servers to hhvm-luasandbox 2.0-3 build, changed config back to luasandbox and restarted hhvm fcgi container. Still seeing crashes:

Host: deployment-mediawiki01
ProcessID: 27248
ThreadID: 7fad89bff700
ThreadPID: 27750
Name: unknown program
Type: Segmentation fault
Runtime: hhvm
Version: heads/wikimedia-0-g8b842db4e2db664a9b4d543047ae154a6dd59de6
DebuggerCount: 0


# 0  virtual thunk to boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::invalid_option_value> >::rethrow() const at /usr/bin/hhvm:0
# 1  lua_sethook at /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0:0
# 2  HPHP::Extension::moduleInfo(HPHP::Array&) at /usr/lib/hphp/extensions/20140702/luasandbox.so:0
# 3  timer_sigev_thread at /build/buildd/eglibc-2.19/rt/../nptl/sysdeps/unix/sysv/linux/timer_routines.c:66
# 4  start_thread at /build/buildd/eglibc-2.19/nptl/pthread_create.c:312
# 5  clone at /build/buildd/eglibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:113

$ hhvm --version
HipHop VM 3.3.0-dev (rel)
Compiler: heads/wikimedia-0-g8b842db4e2db664a9b4d543047ae154a6dd59de6
Repo schema: ce469da81c1d8ec23f3a4aa889afadad8df5a759

$ dpkg -l|grep ^ii|awk '{printf "%-20s %s\n", $2, $3}'|grep hhvm
hhvm                 3.1+20140723-1+wmf1
hhvm-dev             3.1+20140723-1+wmf1
hhvm-fss             1.1-2
hhvm-luasandbox      2.0-3
hhvm-wikidiff2       1.3-2
Comment 17 Ori Livneh 2014-07-25 00:43:54 UTC
(In reply to Bryan Davis from comment #16)
> Cross post from https://bugzilla.wikimedia.org/show_bug.cgi?id=68413#c8 :
> 
> Updated beta servers to hhvm-luasandbox 2.0-3 build, changed config back to
> luasandbox and restarted hhvm fcgi container. Still seeing crashes:

That's bug 68413 (as you note).

(In reply to Tim Starling from comment #14)
> On hhvm.256.io, I have installed a LuaSandbox statically linked against Lua
> compiled as C++. With this build, the reduced test case exits cleanly with
> "Fatal error: unknown exception". That's quite a nice failure mode, and
> deals with all Lua errors thrown from unprotected functions, not just OOMs.

Confirmed that the updated package behaves correctly too.
Comment 18 Brad Jorsch 2014-08-25 21:17:59 UTC
So if I understand the part Ori was asking me about today, we're still wanting to look at the memory limit handling as Tim talked about in comment 12? If the below sounds fine to you, Tim, I can have code in Gerrit tomorrow.


(In reply to Tim Starling from comment #12)
> Also, because Lua is in the stack, the in_lua flag is set, so "slop" is
> zero. That is to say, the hack intended to fix bug 59130 is disabled. The
> Lua userspace takes the usage to within 36 bytes of the limit, and
> lua_pushcclosure() tries to allocate 40, so after the error is handled, it's
> assured that re-entering the same state will cause an unprotected OOM.
> 
> in_lua and in_php are both flags indicating the contents of the stack, not
> the immediate caller. Maybe we need a third flag which indicates that the
> immediate caller is unprotected. 

The fix for bug 59130 was just checking for whether the call was protected at some level, which in_lua indicates. But it sounds like the problem here is that we can't have Lua error handling jumping back to an outer level of recursion, so we need to consider protectedness at each level instead of just globally.

I don't think we need a third flag: we start out unprotected (with in_lua=0 and in_php=0). When we enter protected mode, that's luasandbox_call_helper incrementing in_lua and doing a lua_pcall. A recursive call from Lua back to PHP is going to go through luasandbox_call_php, which increments in_php, and then calling back into Lua will increment in_lua again. So it should be the case that in_lua == in_php+1 whenever we're protected and in_lua == in_php whenever we're not.

If we were to add a "protected" flag, we'd basically replicate that logic: clear at start, save-and-set it before the pcall then restore it after in luasandbox-call_helper, and save-and-clear it before calling into PHP and then restore it after in luasandbox_call_php.

I put in asserts testing the relationship between in_lua and in_php, and both the luasandbox tests and the Scribunto tests all passed locally (on Zend, but that shouldn't make a difference) without tripping them.

> Also, I think the 1 MB slop might be a bit too conservative. Since the
> result of an OOM is a crash, we should probably just disable the memory
> limit entirely for unprotected calls.

I chose 1MB because it seemed like it would be reasonable for what the unprotected calls were doing, while still having some limitation in case someone figured out some way to make an unprotected call use large amounts of memory. But if that assumption is wrong, it's easy enough to disable the limit entirely. Would we still keep the prevention of a single allocation greater than the memory limit, or disable that one too?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links