Last modified: 2014-08-27 23:12:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63102, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61102 - Soften qdel behaviour from KILL
Soften qdel behaviour from KILL
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized enhancement
: ---
Assigned To: Marc A. Pelletier
:
: 63878 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-09 07:34 UTC by Tim Landscheidt
Modified: 2014-08-27 23:12 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2014-02-09 07:34:47 UTC
At the moment, qdel KILLs the job;  this is a bit rude.

If jsub would call "qsub -notify", SGE would signal the job before KILLing it.

The signal is set by "execd_param"'s NOTIFY_KILL; default is SIGUSR1, I would favour SIGTERM (or a SIGHUP -> SIGINT -> SIGTERM cascade) as I suppose more programs will already have a suitable handler for that.

The queue parameter "notify" defines the interval between signals; given that many jobs in Tools use database and other network connections, I would be fairly generous here and propose 60 s (that means in the worst case of a SIGHUP -> SIGINT -> SIGTERM -> SIGKILL cascade 180 s which I find acceptable; for special cases, roots can always log into the exec node and kill at will).
Comment 1 Tim Landscheidt 2014-04-24 13:47:29 UTC
*** Bug 63878 has been marked as a duplicate of this bug. ***
Comment 2 Tim Landscheidt 2014-04-24 13:48:24 UTC
We need to use "qsub -notify" in webservice as well.
Comment 3 metatron 2014-04-26 14:57:18 UTC
Concerning (non) termination of php-cgi processes:

http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI

There is an option "kill-signal" in .lighttpd fcgi settings.

"kill-signal": By default lighttpd send SIGTERM to FastCGI processes, which were spawned by lighttpd. Applications, which link libfcgi, need to be killed with SIGUSR1. This applies to php <5.2.1, lua-magnet and others.

I tried setting this value to 9, also to 1. But in neither case, the signal was forwarded to the spawned cgi-processes, while killing with 9 and 1 by hand worked.

This (mis)behaviour seems to matter also in the case of overloaded and dying webservices, as overloaded threads/processes are /not/ terminated as they should be.
Comment 4 Tim Landscheidt 2014-04-26 15:34:30 UTC
The program flow is different at the moment: On qdel, SGE kills the master lighttpd process with SIGKILL.  Thus, lighttpd never has a chance to kill the php-cgi processes.  So kill-signal is irrelevant at the moment.
Comment 5 Marc A. Pelletier 2014-08-27 23:12:04 UTC
The grid has been adjusted to use SIGTERM by default now; this problem should be solved.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links