Last modified: 2014-09-15 01:15:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72177, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70177 - The way LuaSandbox measures CPU time is not thread-safe
The way LuaSandbox measures CPU time is not thread-safe
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Scribunto (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Tim Starling
: hhvm
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-29 14:34 UTC by Jackmcbarn
Modified: 2014-09-15 01:15 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
A naive fix that doesn't work (2.62 KB, patch)
2014-08-31 05:23 UTC, Ori Livneh
Details
Small helper script for reproducing the bug (1.05 KB, text/plain)
2014-09-01 02:11 UTC, Ori Livneh
Details

Description Jackmcbarn 2014-08-29 14:34:49 UTC
This is from an HHVM jobrunner:
<!-- 
NewPP limit report
Parsed by mw1053
CPU time usage: 41.412 seconds
Real time usage: 3.443 seconds
Preprocessor visited node count: 13901/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 316970/2048000 bytes
Template argument size: 35984/2048000 bytes
Highest expansion depth: 23/40
Expensive parser function count: 16/500
Lua time usage: 10.352/10.000 seconds
Lua memory usage: 2.54 MB/50 MB
-->

<!-- Saved in parser cache with key enwiki:pcache:idhash:13146361-0!*!0!*!*!4!* and timestamp 20140829142304 and revision id 269146468
 -->

This is the same page, also HHVM, but from a null-edit rather than a job:
<!-- 
NewPP limit report
Parsed by mw1019
CPU time usage: 0.986 seconds
Real time usage: 1.276 seconds
Preprocessor visited node count: 13994/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 319685/2048000 bytes
Template argument size: 35683/2048000 bytes
Highest expansion depth: 23/40
Expensive parser function count: 16/500
Lua time usage: 0.290/10.000 seconds
Lua memory usage: 2.59 MB/50 MB
-->

<!-- Saved in parser cache with key enwiki:pcache:idhash:13146361-0!*!0!*!*!4!* and timestamp 20140829142922 and revision id 269146468
 -->

The job runner's CPU time is much higher for no apparent reason. This has been going on for several days/weeks.
Comment 1 Ori Livneh 2014-08-30 06:39:33 UTC
The issue is that LuaSandbox is not thread-safe. The figure you're seeing conflates the CPU time of several concurrent threads of execution.

* LuaSandbox uses CLOCK_PROCESS_CPUTIME_ID to measure time, but CLOCK_PROCESS_CPUTIME_ID measures CPU time consumed by all threads of the calling process.

* LuaSandbox requests to be notified with a SIGEV_SIGNAL upon timer expiration, but at any point in time, only only one such signal is queued per timer per process.  

* LuaSandbox uses the sigprocmask() system call to block signals, but its behavior in a multithreaded process is unspecified.  pthread_sigmask() must be used instead.

* When the normal or emergency timers are stopped, any queued SIGEV_SIGNAL for the entire process is flushed. 

* When multiple threads are executing concurrently, the timer for the entire process is repeatedly re-armed (i.e., the exiration is reset).

* Confusingly, LUASANDBOX_G(signal_handler_installed) is thread-local, not global. This means every thread calls luasandbox_timer_install_handler.  The latter caches the old handler in another thread-local variable and reinstalls it when LuaSandbox is deactivating. When LuaSandbox is active in multiple threads, individual threads end up re-installing each other's signal handlers, keeping each other alive.
  
The above applies specifically to the normal and emergency timers. The profiling feature has its own bag of thread-safety / reentrancy issues, which I propose we deal with in the context of bug 68413 rather than this bug.
Comment 2 Jackmcbarn 2014-08-30 13:57:53 UTC
I notice that it's not just Lua time usage that's bad; core's "CPU time usage" is also really high. Is that problem related to this one?
Comment 3 Ori Livneh 2014-08-31 01:51:08 UTC
(In reply to Jackmcbarn from comment #2)
> I notice that it's not just Lua time usage that's bad; core's "CPU time
> usage" is also really high. Is that problem related to this one?

Yes. The fact that concurrent threads reinstall each other's signal handlers means that requests can make each other linger, increasing the backend response time.

Your question prompted me to take a look at ParserOutput::getTimes. I found that it also has a separate bug which, like this one, stems from the assumptions it makes about the underlying runtime's threading model. Filed as bug 70227.
Comment 4 Jackmcbarn 2014-08-31 02:01:12 UTC
Removing performance keyword, since this is a bug in the timer rather than anything actually being slow.
Comment 5 Victor Vasiliev 2014-08-31 02:03:27 UTC
So, is there more to this bug than replacing CLOCK_PROCESS_CPUTIME_ID with CLOCK_THREAD_CPUTIME_ID?
Comment 6 Ori Livneh 2014-08-31 05:23:25 UTC
Created attachment 16326 [details]
A naive fix that doesn't work

(In reply to Victor Vasiliev from comment #5)
> So, is there more to this bug than replacing CLOCK_PROCESS_CPUTIME_ID with
> CLOCK_THREAD_CPUTIME_ID?

Yes. See the attached patch, which doesn't work. (It segfaults in luasandbox_timer_set_one_time). There are things which should be thread-local but aren't. It's going to take some very careful work to fix this.
Comment 7 Ori Livneh 2014-08-31 05:38:58 UTC
Two ideas:

* In a multithreaded environment, it might be a lot saner to use the timerfd_*[1] family of system calls, which operate on timers that notify via file descriptors rather than signals. But this is a Linux-specific feature, so it would come at the cost of portability.

* In an ideal world, LuaSandbox would use high-level timing and code profiling interfaces provided by HHVM, such as Timer::GetRusageMicros[2], cpuCyles[3], etc.

[1]: http://man7.org/linux/man-pages/man2/timerfd_create.2.html
[2]: https://github.com/facebook/hhvm/blob/master/hphp/util/timer.cpp#L108
[3]: https://github.com/facebook/hhvm/blob/master/hphp/util/cycles.h#L29
Comment 8 Ori Livneh 2014-09-01 02:11:54 UTC
Created attachment 16332 [details]
Small helper script for reproducing the bug

Fires off multiple requests to the enwiki API to parse [[en:Barack Obama]]. Only usable from the WMF cluster.
Comment 9 Gerrit Notification Bot 2014-09-08 04:07:16 UTC
Change 159011 had a related patch set uploaded by Tim Starling:
[WIP] Make timer/profiler be thread-safe

https://gerrit.wikimedia.org/r/159011
Comment 10 Gerrit Notification Bot 2014-09-11 07:00:47 UTC
Change 159011 merged by jenkins-bot:
Make timer/profiler be thread-safe

https://gerrit.wikimedia.org/r/159011

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links