Last modified: 2013-06-17 22:43:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T51599, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 49599 - Dying workers are not always restarted
Dying workers are not always restarted
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
Web API (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: C. Scott Ananian
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-14 22:23 UTC by Gabriel Wicke
Modified: 2013-06-17 22:43 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2013-06-14 22:23:18 UTC
In production it seems that dying workers (due to exceptions) are not always restarted. In some cases there are no 'restarting' messages at all in nohup.out despite most workers having disappeared.

Production is running node 0.8.2 and latest node_modules as of today.
Comment 1 Gabriel Wicke 2013-06-14 22:27:58 UTC
Command to get an overview about the number of node processes in the parsoid group: 
dsh -g parsoid 'echo -n "`hostname` "; ps aux | grep node | wc -l'
Comment 2 C. Scott Ananian 2013-06-17 19:46:10 UTC
I'm going to tackle this one today, first by trying to determine if unix signals, OOM, or stack crashers can reproduce this problem.  gwicke indicates that simple exceptions aren't enough to reproduce.
Comment 3 Gabriel Wicke 2013-06-17 19:58:45 UTC
We are currently registering for a 'death' event, but that is no longer available in cluster 0.8.17 (http://nodejs.org/dist/v0.8.17/docs/api/cluster.html) nor in 0.10. So it seems that we need to register for 'disconnect' and/or 'exit' instead.
Comment 4 Gerrit Notification Bot 2013-06-17 20:11:24 UTC
Related URL: https://gerrit.wikimedia.org/r/69151 (Gerrit Change I2b7119c928ed27e26181c67c6d300f526cd53801)
Comment 5 Gabriel Wicke 2013-06-17 22:23:46 UTC
Just deployed this patch to production. Will monitor the number of Parsoid workers and close this bug if that number remains constant.
Comment 6 Gabriel Wicke 2013-06-17 22:43:38 UTC
Things look good so far, so closing as fixed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links