Last modified: 2014-04-24 13:47:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65878, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63878 - "webservice stop" doesn't stop php-cgi processes
"webservice stop" doesn't stop php-cgi processes
Status: RESOLVED DUPLICATE of bug 61102
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
: 64095 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-13 14:26 UTC by Tim Landscheidt
Modified: 2014-04-24 13:47 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2014-04-13 14:26:03 UTC
"webservice stop" stops the lighttpd processes, but not the php-cgi processes.  If the lighttpd is restarted on a different webgrid host, these just become zombies.  More importantly, if lighttpd is restarted on the same webgrid host, they probably retain their original environment, so changes in the configuration might not affect them.
Comment 1 Tim Landscheidt 2014-04-22 19:36:44 UTC
*** Bug 64095 has been marked as a duplicate of this bug. ***
Comment 2 Tim Landscheidt 2014-04-24 02:49:35 UTC
Idea: Replace "qdel -j $job" with "ssh $WEBGRIDHOST 'kill -TERM $(cat /var/run/lighttpd/wikilint.pid)'".  This will make lighttpd shut down in an orderly fashion taking the php-cgi processes with it (and even offers a model for graceful shutdowns with "kill -INT").

I asked on users@gridengine.org (cf. http://permalink.gmane.org/gmane.comp.clustering.opengridengine.user/7487) how to find out the host a job is running on, but didn't get an "easy" answer (yet).  Working:

| qstat -xml | xmllint --xpath "substring-after(/job_info/queue_info/job_list[@state = 'running' and JB_name = 'lighttpd-wikilint']/queue_name/text(), '@')" -
Comment 3 Tim Landscheidt 2014-04-24 13:47:29 UTC
The problem with this approach would be the dependence on "webservice stop" being the only way to kill a job.

If for example the grid would transfer the job to another host, it would still just use SIGKILL, and we would be back at square one.

So the sensible solution is to use "qsub -notify" and a suitable set of signals and timeouts.

*** This bug has been marked as a duplicate of bug 61102 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links