Last modified: 2013-10-16 17:07:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55776, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53776 - The round-trip server seems to leak memory in production
The round-trip server seems to leak memory in production
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
tests (Other open bugs)
unspecified
All All
: High normal
: ---
Assigned To: Marc Ordinas i Llopis
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-04 21:39 UTC by Gabriel Wicke
Modified: 2013-10-16 17:07 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2013-09-04 21:39:47 UTC
I just restarted the rt server when it was using 2.1g of RAM, 100% CPU on one core and not returning requests to the web frontend. MySQL was idle.
Comment 1 Marc Ordinas i Llopis 2013-09-05 17:16:16 UTC
After some rounds of heap profiling, the major increases in size are in compiled code and in arrays of code deoptimization_data.

Will keep on profiling.
Comment 2 Gabriel Wicke 2013-10-04 00:01:32 UTC
Just happened again. Had to kill -9 it, normal restart did not help. top showed 1.4G rss.
Comment 3 ssastry 2013-10-06 17:33:12 UTC
Gabriel: are you sure this is the rt server? Here is latest top from which I extracted nodejs/node info. This seems like the parsoid web service clogged up. The RT server command is 'node' and parsoid web service is 'nodejs' below. 

-----------------------------------------------------------------------------
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  GROUP    COMMAND
------------------------------------------------------------------------------
10527 nobody    20   0 3555m 2.6g 3208 R   83 16.3 999:13.13 nogroup  nodejs
 5192 nobody    20   0 2554m 1.6g 3288 S    0 10.3 667:38.03 nogroup  nodejs
29942 nobody    20   0 2238m 1.3g 2408 R   60  8.6 830:00.08 nogroup  nodejs
13788 nobody    20   0 2149m 1.3g 3176 R   74  8.0 985:14.12 nogroup  nodejs
29857 nobody    20   0 1988m 1.1g 1880 R   76  7.1   2700:26 nogroup  nodejs
 3800 nobody    20   0 1945m 1.1g 2448 R   72  6.8 741:21.81 nogroup  nodejs
 5973 nobody    20   0 1905m 1.0g 1984 R   92  6.5   1201:01 nogroup  nodejs
  348 nobody    20   0 1649m 795m 3324 R   87  5.0 309:30.79 nogroup  nodejs
30270 nobody    20   0 1580m 714m 3280 R   78  4.5 350:47.62 nogroup  nodejs
12581 nobody    20   0 1454m 603m 5408 R   86  3.8  86:18.80 nogroup  nodejs
11284 nobody    20   0 1372m 522m 5400 R   84  3.3 138:46.81 nogroup  nodejs
------------------------------------------------------------------------------
19373 mysql     20   0 1313m 297m 2292 S    6  1.9   5620:09 mysql    mysqld
------------------------------------------------------------------------------
22494 nobody    20   0  858m 190m 2540 S    0  1.2 104:52.24 nogroup  node
10390 nobody    20   0  662m  28m 1848 S    0  0.2   0:03.79 nogroup  node
------------------------------------------------------------------------------

I tried restarting and the following processes didn't restart (I'll kill -9 them). So, I suspect it is not the RT server.

29942 nobody    20   0 2224m 1.3g 2408 R  102  8.5 834:14.19 nodejs
13788 nobody    20   0 2166m 1.3g 3176 R   99  8.1 989:05.14 nodejs
29857 nobody    20   0 2022m 1.1g 1880 R  101  7.3   2704:39 nodejs
 3800 nobody    20   0 1955m 1.1g 2448 R  100  6.8 745:32.87 nodejs
 5973 nobody    20   0 1915m 1.0g 1984 R  101  6.6   1205:07 nodejs
11284 nobody    20   0 1383m 535m 5400 R  102  3.3 142:59.75 nodejs
Comment 4 ssastry 2013-10-07 01:02:58 UTC
I see it now as well ... 1.5G so far. t So, there are probably different sources -- some for workers, some for the RT server. Hmm ...

I got confused because when I looked at top at that time, the test run had completed. So, I suppose it is either the title fetch or the save-results endpoint that is probably leaking memory (need not be our code necessarily, could also be something in the libraries).
Comment 5 Marc Ordinas i Llopis 2013-10-08 11:24:03 UTC
I did detect some amount of memory leakage while profiling locally, if I recall correctly, ~90k per fetch/results cycle. As I mentioned in a previous comment, the leakage seemed to come from not-released arrays of "code deoptimization_data". If I remember correctly V8's arch that could come from virtual method calls over many different implementations, debugging/profiling artifacts, or a few other things. And of course it could be our code or some library leaking closures, but I think those would show up differently, especially if the code is used only once.

That's where profiling took me… I'm a little stumped of what to try next, but I think that I'll make a small test case to check it's not the mysql library.
Comment 6 Gabriel Wicke 2013-10-08 22:09:09 UTC
I took a heap snapshot in production and inspected it in chromium. It seems that a large number of closures are responsible for the memory leak. From browsing the data a bit it seems likely that the claimPage closure is responsible for much of the leaking, which is not surprising as it is called so often. I also see a large number of Error instances in the heap, which might indicate that an error code path (transaction rollback?) is responsible for the leak.

The use of the mysql bindings seems to differ a bit from the documentation at https://github.com/felixge/node-mysql#transactions. Diffs I see are the execute() and the end and calling rollback on the return value of startTransaction rather than db. There are some other structures that look a bit cyclical like the transUpdateCB binding. Maybe some of this can be converted to use async (https://github.com/felixge/node-mysql#transactions)?
Comment 7 Marc Ordinas i Llopis 2013-10-10 09:43:53 UTC
Just a note on why the code seems different to https://github.com/felixge/node-mysql#transactions : The transactions support in node-mysql is very recent and (afaik) hasn't been released to npm yet. We're using node-mysql-queues for transactions, at https://github.com/bminer/node-mysql-queues .
Comment 8 Marc Ordinas i Llopis 2013-10-10 17:47:32 UTC
After following retained memory in gwicke's snapshot, it seems the biggest culprit is the _callSite member of Query. Lo and behold, then I found this commit for node-mysql:
https://github.com/felixge/node-mysql/commit/fd984a71c715a9ffe54b9154d406a85e3711940f

Can we check that the server is running the latest node-mysql version, 2.0.0-alpha9?
Comment 9 Gabriel Wicke 2013-10-10 19:34:50 UTC
It was at 2.0.0-alpha8. It looks like upgrading to 2.0.0-alpha9 has helped.
Comment 10 Gabriel Wicke 2013-10-10 21:25:43 UTC
Looks better, but:
[14:24] <gwicke> mysql on the rt server does not look too happy
[14:24] <gwicke> the node portion of it is now at 380M
[14:24] <gwicke> probably a lot of waiting / conflicting transactions

Lets see how this develops.
Comment 11 Gabriel Wicke 2013-10-16 17:07:36 UTC
node-mysql was not updated properly. Mea culpa.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links