Last modified: 2014-01-08 22:29:02 UTC
On RT-testing clients, a few pages seem to cause long post-result pauses (could be serializer, could be diff classifier) .. which could either be memory leak, a library issue, .. plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1 (88064 ms) itwiki:Campionato_Dilettanti_1958-1959 (parses in 11558 ms) itwiki:Città_della_Pennsylvania_(M-P) (parses in 35763 ms) This could be node 0.10 specific, but I haven't tested on node 0.8 yet. I can verify that at least the plwiki page above takes a loooooong time to parse. But, adding info for now before I run out so this can be investigated later.
Another: enwiki:Russia_at_the_2004_Summer_Olympics in 35387 ms
Change 103639 had a related patch set uploaded by GWicke: Bug 58952: Avoid repeated require of Util from defines https://gerrit.wikimedia.org/r/103639
Change 103756 had a related patch set uploaded by GWicke: Bug 58952 WIP: Try to make sure env and old documents are freed early https://gerrit.wikimedia.org/r/103756
Actually, on some of these pages, it looks like at least in RT testing, diff classification is the culprit. Same behavior on node 0.8.26, node 0.10.19,and node 0.10.24. ---------- [subbu@earth tests] time node roundtrip-test.js --prefix itwiki Campionato_Dilettanti_1958-1959 > /dev/null completed parsing of itwiki:Campionato_Dilettanti_1958-1959 in 11524 ms wt2html + html2wt time: 14674 130.808u 1.072s 2:28.11 89.0% 0+0k 0+0io 0pf+0w ---------- [subbu@earth tests] time node roundtrip-test.js --prefix plwiki 'Podsumowanie_startów_zespołu_Lotus_w_Formule_1' > /dev/null completed parsing of plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1 in 57277 ms wt2html + html2wt time: 64485 464.153u 3.188s 9:25.44 82.6% 0+0k 0+0io 0pf+0w ---------- node 0.10 seems faster than 0.8 on the plwiki page. In addition, roundtrip-test.js on node 0.8 failed with FATAL ERROR: JS Allocation failed - process out of memory. So, RT-issues on these pages are present in both 0.8 and 0.10, it seems.
It looks like jsdiff is the culprit at least in rt-testing for inefficient processing of some pages -- main cause for both long processing times + memory usage (peak 2.3g virtual, 1.3g resident) on the plwiki test page. ---------------------- [subbu@earth tests] time node roundtrip-test.js --prefix itwiki Campionato_Dilettanti_1958-1959 > /dev/null completed parsing of itwiki:Campionato_Dilettanti_1958-1959 in 45254 ms after parse: Thu Dec 26 2013 15:48:46 GMT-0600 (CST) wt2html + html2wt time: 49303 after serialization before rt-diff: Thu Dec 26 2013 15:48:49 GMT-0600 (CST) after jsdiff: Thu Dec 26 2013 15:51:12 GMT-0600 (CST) after convert-diff: Thu Dec 26 2013 15:51:12 GMT-0600 (CST) before double-rt-diff: Thu Dec 26 2013 15:51:13 GMT-0600 (CST) before checkIfSignificant: Thu Dec 26 2013 15:51:55 GMT-0600 (CST) 159.505u 1.272s 3:58.13 67.5% 0+0k 0+0io 0pf+0w ---------------------- Relevant entry for plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1 ---------------------- after serialization before rt-diff: Thu Dec 26 2013 15:56:42 GMT-0600 (CST) after jsdiff: Thu Dec 26 2013 16:03:40 GMT-0600 (CST) ----------------------
The general GC / possible memory leak issue happens without jsdiff too. The backend server forked by roundtrip-test.js grows in rss when processing many requests in a row until 1.7G is reached. Only pure wt2html and html2wt modes are used there, and only html2wt POSTs seem to trigger the memory growth. Some of my patches above are geared towards making it more likely for the small incremental collections in 0.10 to still pick up garbage as early as possible by explicitly breaking reference cycles. The v8 GC in node 0.10 is apparently geared more towards low pause times than thorough GC. I experimented with a limited --max_old_space_size and a few other GC related parameters to force full old space collections, and was able to get rt tests to run within 300m rss for a while. GC would use most time when it got close to the limit though, so more experimentation with a slightly higher limit is needed. This issue is also discussed at https://github.com/joyent/node/issues/5828. The GC in v8 3.20 (will be in node 0.12) might have relevant fixes.
Hmm ... okay .. so, I guess it looks like we have to do a direct upgrade to 0.12 instead then unless we get a reliable workaround with the --max_old_space_size param.
Also possibly relevant: The "LATENCY AND IDLE GARBAGE COLLECTION" paragraph in http://blog.nodejs.org/2013/03/11/node-v0-10-0-stable/
Change 105875 had a related patch set uploaded by GWicke: WIP Bug 58952: Help incremental GC further by breaking refs https://gerrit.wikimedia.org/r/105875
Change 105875 merged by jenkins-bot: Bug 58952: Help incremental GC further by breaking refs https://gerrit.wikimedia.org/r/105875
The 0.10 specific memory leak issue seems to be solved by these patches. After the last round-trip test run, the largest node process used 600m with most using around 300, which is very close to what we have seen in 0.8. The jsdiff issue on large pages is however still present and indpendent of the node version. I have created bug 59840 to track just the jsdiff issue. I'm closing this bug as fixed, as it was primarily about node 0.10 issues.
A dsh command to list memory of all node processes in rt testing (from https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing#Some_dsh_tricks): dsh -M -cf /home/gwicke/rtclients ps -C nodejs -o rss= | sort -n -k2