Last modified: 2014-08-25 17:16:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71934, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69934 - Web services continually restarting
Web services continually restarting
Status: UNCONFIRMED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-23 07:04 UTC by bgwhite
Modified: 2014-08-25 17:16 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description bgwhite 2014-08-23 07:04:53 UTC
Since August 17, web services have continually gone up and down.  Big Brother does restart web services.  The few times I was logged into labs when things went down, I did a 'qstat -f'.  Load averages for web-grid nodes were at or above 20.
Comment 1 metatron 2014-08-23 09:38:35 UTC
I can confirm this. Multiple restarts for no clear reason. In both emaples a running service had been terminated + restarted automatically.

Bigbrother mails:
2014-08-21 05:31:16 info: Restarting job 'lighttpd-xtools'
2014-08-21 05:33:05 warn: job 'lighttpd-xtools' failed to start
2014-08-21 05:33:05 info: Restarting job 'lighttpd-xtools'

2014-08-21 20:20:31 info: Restarting job 'lighttpd-xtools'
2014-08-21 20:22:30 warn: job 'lighttpd-xtools' failed to start
2014-08-21 20:22:31 info: Restarting job 'lighttpd-xtools'

qacct reports:
jobname      lighttpd-xtools
jobnumber    3328857
taskid       undefined
account      sge
priority     0
qsub_time    Thu Aug 21 05:33:06 2014
start_time   Thu Aug 21 05:33:18 2014
end_time     Thu Aug 21 20:20:27 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0

jobname      lighttpd-xtools
jobnumber    3345286
taskid       undefined
account      sge
priority     0
qsub_time    Thu Aug 21 20:22:32 2014
start_time   Thu Aug 21 20:22:33 2014
end_time     Fri Aug 22 11:37:33 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0
Comment 2 metatron 2014-08-23 09:42:10 UTC
*examples
Comment 3 Marc A. Pelletier 2014-08-25 17:16:46 UTC
A quick perusal of the logs show that this happens only to a short (~12) list of webservices, in bursts.

My current working hypothesis is that this is due to leaking fcgi combined with memory pressure (that is, the problem is always present but leads to webservices being restarted only when resource use is especially high).

Could the maintainers of the affected tools please look into their logs to see if those restart match periods of unusual activity?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links