Last modified: 2014-11-15 01:21:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T75395, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 73395 - Parsoid should use SO_REUSEADDR when it binds to its port
Parsoid should use SO_REUSEADDR when it binds to its port
Status: NEW
Product: Parsoid
Classification: Unclassified
Web API (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Parsoid Team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-11-13 22:16 UTC by C. Scott Ananian
Modified: 2014-11-15 01:21 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description C. Scott Ananian 2014-11-13 22:16:54 UTC
Today during the deploy, several of the parsoid hosts remained down for quite a long time.  Their logs were filled with:

{"name":"parsoid","hostname":"wtp1001","pid":14985,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綉","oldId":26153377,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153
377","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153377","time":"2014-11-1
3T21:15:42.711Z","v":0}
{"name":"parsoid","hostname":"wtp1001","pid":14984,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綊","oldId":17538871,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538
871","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538871","time":"2014-11-1
3T21:15:42.713Z","v":0}
[warning][worker][14995] shutting down
[warning][worker][16672] shutting down
[warning][worker][14987] shutting down
[info][master][14981] shutting down, killing workers
[warning][worker][14996] shutting down
[warning][worker][15004] shutting down
[warning][worker][14990] shutting down
[warning][worker][14998] shutting down
[warning][worker][14988] shutting down
[warning][worker][14992] shutting down
[warning][worker][14985] shutting down
[warning][worker][15011] shutting down
[warning][worker][14984] shutting down
[warning][worker][15001] shutting down
[warning][worker][15007] shutting down
{"name":"parsoid","hostname":"wtp1001","pid":15003,"level":30,"logType":"info","wiki":"rowiki","title"
:"2012","oldId":8889091,"msg":"completed parsing in 7469 ms","longMsg":"completed parsing in\n7469\nms
","time":"2014-11-13T21:15:43.087Z","v":0}
[warning][worker][15003] shutting down
[info][master][14981] exiting
[info][master][16784] initializing 15 workers
[info][worker][16793] loading ...
[info][worker][16790] loading ...
[info][worker][16788] loading ...
[info][worker][16798] loading ...
[info][worker][16795] loading ...
[info][worker][16801] loading ...
[info][worker][16806] loading ...
[info][worker][16793] ready on :8000
[error][worker][16793] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.5:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16790] ready on :8000
[info][worker][16798] ready on :8000
[info][worker][16795] ready on :8000
[error][worker][16790] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.3:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[error][worker][16798] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.7:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16801] ready on :8000
[error][worker][16795] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.6:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[error][worker][16801] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.9:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16806] ready on :8000
[info][worker][16791] loading ...
[error][worker][16806] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.11:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16787] loading ...
[info][worker][16788] ready on :8000
[error][worker][16788] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.2:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16799] loading ...
[info][worker][16804] loading ...
[info][worker][16810] loading ...
[info][worker][16808] loading ...
[info][worker][16812] loading ...
[info][worker][16814] loading ...
[info][worker][16791] ready on :8000
[error][worker][16791] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.4:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[info][worker][16787] ready on :8000
[error][worker][16787] Port 8000 is already in use. Exiting.
Stack:
  HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
  HTTPServer.EventEmitter.emit (events.js:95:17)
  net.js:1081:16
  Object.1:1 (cluster.js:592:5)
  handleResponse (cluster.js:171:41)
  respond (cluster.js:192:5)
[..etc...]

Two things to note here: first, logs from the master aren't going through bunyan, and thus not ending up in logstash.  There's is some discussion of this on https://gerrit.wikimedia.org/r/171618.

Secondly, we need to use SO_REUSEADDR when we open port 800 so that we don't have to wait for the full linger time to bring up the service after a deploy.
See http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t
Comment 1 C. Scott Ananian 2014-11-13 22:19:27 UTC
Filed bug 73396 for the "logs don't make it to logstash" issue.
Comment 2 C. Scott Ananian 2014-11-13 22:42:44 UTC
Node's docs claim that they already use SO_REUSEADDR:

http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback

This makes our EADDRINUSE errors very mysterious.

However, the 'cluster' package does mysterious things: http://nodejs.org/api/cluster.html#cluster_how_it_works

It's possible that's what's going wrong here, somehow?
Comment 3 ssastry 2014-11-14 02:45:58 UTC
Given that we have had this code for a long time, and also given that we have also had a couple deploys post node 0.10 upgrade, and that this is the first deploy after we enabled timeouts, I would say this is related to timeouts (5 mins).
Comment 4 C. Scott Ananian 2014-11-14 05:14:31 UTC
Possibly we're not shutting down cleanly when the service is stopped, and the zombie process continues holding on to the port until the timeout expires.
Comment 5 ssastry 2014-11-14 05:20:28 UTC
So, this is the code that we use to cleanly terminate the workers (see api/server.js)

                cluster.disconnect(function() {
                        logger.log( "info", "exiting" );
                        process.exit(0);
                });

This waits for worker to shut down. However, in the common case, the title should get processed fairly quicly and clear the timeout => all the workers should shut down in a timely fashion. I don't think (or at least, I would be surprised if it were) all those cases of hung processes that do indeed take 5 mins and eventually get killed by the timeout.

So, one question I now have is if we are holding onto timeouts after request processing and not clearing those out cleanly in all cases.
Comment 6 C. Scott Ananian 2014-11-14 21:25:19 UTC
Well, it was 3-4 of the parsoid hosts which had this problem.  So it's possible that these were hung processes which needed to wait for the timeout.

But in this case I would expect 'service parsoid stop' not to actually complete until the timeout had finished and the process had actually stopped.  The problem was that the service was restarted while the old parsoid was still running (or so it would seem).
Comment 7 ssastry 2014-11-15 01:21:46 UTC
Ah, if 3-4 then maybe that is what it is then. Ping gwicke about the service restart part then. He might have a clue what is going on.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links