Last modified: 2013-10-10 21:34:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55241, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53241 - Use Map and Set instead of plain objects / ES6 shim cleanup
Use Map and Set instead of plain objects / ES6 shim cleanup
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Arlo Breault
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-22 21:23 UTC by Gabriel Wicke
Modified: 2013-10-10 21:34 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2013-08-22 21:23:58 UTC
We currently use plain objects in many places where users can pass in keys. Even with Object.create(null), users can pass in the key '__proto__' which will happily set the prototype, for example to an object if that is what we set as a value.

ES6 / harmony Map and Set (see http://wiki.ecmascript.org/doku.php?id=harmony:simple_maps_and_sets) fix this problem (and allow objects as keys), so we should use them. Native Map / Set is not yet complete in node 0.10, so we should probably use one of the existing shims to get it:

https://github.com/paulmillr/es6-shim
https://github.com/Benvie/harmony-collections

The es6 shim also gives us more goodies which might be useful to have for cleaner code. We should not go overboard though, and only use features where a native implementation is likely to happen soon.
Comment 1 Gabriel Wicke 2013-08-22 21:25:48 UTC
Another contender: https://github.com/WebReflection/es6-collections
Comment 2 Gabriel Wicke 2013-08-22 23:14:55 UTC
(In reply to comment #1)
> Another contender: https://github.com/WebReflection/es6-collections

This one uses linear search on a key array to support object keys. Search for 'function betterIndexOf' in https://github.com/WebReflection/es6-collections/blob/master/src/es6-collections.js.
Comment 3 Gabriel Wicke 2013-08-22 23:17:04 UTC
https://github.com/paulmillr/es6-shim/blob/master/es6-shim.js

uses iteration on a linked list in its Map implementation. Search for 'function Map'.
Comment 4 Gabriel Wicke 2013-08-22 23:33:52 UTC
https://github.com/Benvie/harmony-collections/blob/master/harmony-collections.js

seems to be the only one of the contenders so far that uses native objects for primitive keys (and linear search for object keys), so only incurs constant overhead for each string key access. __proto__ is mapped to a randomized string. Memory overhead is only incurred per Map, but not per item.

Seems to be the best option for large maps. For small maps linear search is likely faster.
Comment 5 C. Scott Ananian 2013-09-11 18:12:48 UTC
The standard thunk is something like:

function Map() { this.store = Object.create(null); }
Map.prototype.get = function( k ) { return this.store['$'+k]; }
Map.prototype.set = function( k, v ) { this.store['$'+k] = v; }

...but we should be a little careful not to go overboard here.  In many cases we know that '__proto__' cannot be a possible key.  In other case we are doing queries only (for example, to determine whether a tag name is a member of a specific set), and Object.create(null).__proto__===null, so that's perfectly safe as well.

So, sure we should use safer maps where needed, but it probably doesn't have to be a hugely-invasive patch set.
Comment 6 Gabriel Wicke 2013-09-13 06:20:42 UTC
(In reply to comment #5)
> The standard thunk is something like:
> 
> function Map() { this.store = Object.create(null); }
> Map.prototype.get = function( k ) { return this.store['$'+k]; }
> Map.prototype.set = function( k, v ) { this.store['$'+k] = v; }

This is basically what harmony-collections does for primitive keys when native maps are not available. It also implements the full spec, which is good for consistency even if we don't need objects as keys right now.

> So, sure we should use safer maps where needed, but it probably doesn't have
> to
> be a hugely-invasive patch set.

Yup, only where users pass in arbitrary keys for new items, as I noted in comment #1.
Comment 7 Gerrit Notification Bot 2013-09-21 16:06:33 UTC
Change 85432 had a related patch set uploaded by Arlolra:
WIP: ES6 shim cleanup.

https://gerrit.wikimedia.org/r/85432
Comment 8 Gerrit Notification Bot 2013-10-02 23:33:01 UTC
Change 85432 merged by jenkins-bot:
ES6 shim cleanup

https://gerrit.wikimedia.org/r/85432
Comment 9 Gabriel Wicke 2013-10-07 19:56:58 UTC
harmony-collections are quite slow in production:

----------------------------------------------------------
Latest master:
[subbu@earth tests] time node parserTests --wt2html --wt2wt --selser --html2wt --html2html >&! /dev/null
94.013u 0.728s 1:34.32 100.4%    0+0k 0+0io 0pf+0w
----------------------------------------------------------
[subbu@earth lib] git checkout 064cee74
[subbu@earth tests] time node parserTests --wt2html --wt2wt --selser --html2wt --html2html > & ! /dev/null
59.475u 0.456s 0:59.65 100.4%    0+0k 0+0io 0pf+0w
----------------------------------------------------------

Cscott has a fork that performs better at https://github.com/cscott/harmony-collections.
Comment 10 C. Scott Ananian 2013-10-07 20:11:01 UTC
I've fixed the performance regression by patching harmony-collections.

I might also patch es6-shim to be O(1) so that would be another option.
Comment 11 Gabriel Wicke 2013-10-07 20:32:28 UTC
(In reply to comment #10)
> I've fixed the performance regression by patching harmony-collections.
> 
> I might also patch es6-shim to be O(1) so that would be another option.

Can you post the current timings?
Comment 12 C. Scott Ananian 2013-10-07 21:27:34 UTC
$ node --version
v0.10.20
$ git checkout master
$ time tests/parserTests.js --quiet
real	1m8.476s
user	1m3.036s
sys	0m0.628s

$ git checkout no-es6
$ time tests/parserTests.js --quiet
real	0m37.091s
user	0m37.036s
sys	0m0.444s

With the gross hack in https://gerrit.wikimedia.org/r/88148 :

$ time tests/parserTests.js --quiet
real	0m39.008s
user	0m38.876s
sys	0m0.540s

With https://gerrit.wikimedia.org/r/88194 and https://gerrit.wikimedia.org/r/88193 :

$ time tests/parserTests.js --quiet
real	0m42.924s
user	0m42.668s
sys	0m0.456s
Comment 13 C. Scott Ananian 2013-10-07 22:09:04 UTC
With es6-shim (https://gerrit.wikimedia.org/r/88241 and https://gerrit.wikimedia.org/r/88242 ):

$ time tests/parserTests.js --quiet
real	0m46.265s
user	0m46.228s
sys	0m0.488s

Unlike the harmony-collections patch, this is a fork that can be upstreamed (https://github.com/paulmillr/es6-shim/issues/148).

Don't know why exactly this is slower than the harmony-collections version; I suspect it's due to the other stuff hacked by es6-shim.  The actual Map/Set implementation should be faster.
Comment 14 C. Scott Ananian 2013-10-07 22:57:48 UTC
(In reply to comment #13)
> Don't know why exactly this is slower than the harmony-collections version; I
> suspect it's due to the other stuff hacked by es6-shim.  The actual Map/Set
> implementation should be faster.

I was wrong.  Simply adding es6-shim does not make the parserTests benchmark, in my tests.  The es6-shim implementation seems to be slower solely because its constructor is slower.  We make 88,143 instances of Set via JSUtils.arrayToSet during a run of parserTests, and those constructor calls seem to account for the extra 7s of runtime.
Comment 15 C. Scott Ananian 2013-10-08 16:39:44 UTC
OK.  I suggest moving to es6-shim.  The codebase is much simpler.

Things left to do:

a) profile our code and figure out why we are creating so many different Set objects during a parserTests run.  Possible we can hoist many of them out (into mediawiki.wikitext.constants.js) which would greatly reduce the speed penalty from using es6-shim.

b) (possibly) hack es6-shim to lazy-initialize the Map objects.  Only create the backing linked-list if non-string keys are ever used.  [gwicke suggests that numeric keys could also be sped up, by prefixing string keys with '$' instead of testing against '__proto__']

c) (possibly) add a FastStringSet to JSUtils (along the lines of https://gerrit.wikimedia.org/r/88148 ).

c2) figure out what the hotspots are for Set.has() -- possibly only a few sites actually need to use FastStringSet.

d) subbu has also requested re-doing some of the performance comparisons above using a single (large) article as a benchmark, instead of ParserTests.  This would help determine whether the performance loss was significant in real use (as opposed to just in parserTests).

Arlo: I'm moving on to other tasks right now.  Go ahead and tackle any of the above that you like.
Comment 16 ssastry 2013-10-08 16:46:47 UTC
Regarding (d), here is one possible page: fr:Coupe_du_pays_de_Galles_de_football
See https://gist.github.com/subbuss/6876447 for some tests I did y'day morning.
Comment 17 C. Scott Ananian 2013-10-09 06:13:26 UTC
wrt (b) https://github.com/cscott/es6-shim/tree/faster-map recovers the lost speed:

$ time tests/parserTests.js --quiet
real	0m38.924s
user	0m38.896s
sys	0m0.464s
Comment 18 Arlo Breault 2013-10-10 05:25:43 UTC
For a), this seems to be the largest culprit: https://gerrit.wikimedia.org/r/88927
Comment 19 Arlo Breault 2013-10-10 05:59:45 UTC
Subbu, here's a comparison as requested in d),
https://gist.github.com/arlolra/6913667

So it's clear, in this case:

master is before that merge that just went in, so ec485737eaa10f5336578e5a9f54fe944de9fc23

no-es6 is master with, git revert 560c79e9a1ee8d01bc9cc4b862f280fa6752a610

and es6regress is what just went in dea9cbfa8ff4a6aea12930ca13c2d0f3491d8477

Still seeing a regression so taking one of Cscott's solutions is probably best.
Comment 20 Gerrit Notification Bot 2013-10-10 18:15:29 UTC
Change 88242 had a related patch set uploaded by Cscott:
Replace harmony-collections with es6-shim.

https://gerrit.wikimedia.org/r/88242
Comment 21 Gerrit Notification Bot 2013-10-10 20:02:48 UTC
Change 88242 merged by jenkins-bot:
Replace harmony-collections with es6-shim

https://gerrit.wikimedia.org/r/88242
Comment 22 Arlo Breault 2013-10-10 21:34:48 UTC
Seems good.

starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 6671 ms
starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 4247 ms
starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 4395 ms

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links