Last modified: 2013-02-20 16:32:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T46447, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 44447 - Top-level block-wise tokenization for better performance
Top-level block-wise tokenization for better performance
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
tokenizer (Other open bugs)
unspecified
All All
: Low normal
: ---
Assigned To: ssastry
: performance
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-29 00:14 UTC by Gabriel Wicke
Modified: 2013-02-20 16:32 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2013-01-29 00:14:06 UTC
The current tokenizer performs a single pass and does not yield to other async tasks until it is done. This queues up a lot of tokens and async actions at once. It would be more efficient to cooperatively yield after each top-level block or so, so that some async processing can already happen as soon as the data becomes available. A simple process.nextTick call after each top-level block and a new offset parameter to the tokenizer to re-start tokenization at the given offset are probably all that is needed to achieve this.
Comment 1 Gabriel Wicke 2013-02-06 21:25:21 UTC
tsr values on tokens also need to be updated for subsequent blocks, as the internal offset in the tokenizer will always be zero-based.
Comment 2 ssastry 2013-02-20 16:32:02 UTC
Implemented in https://gerrit.wikimedia.org/r/#/c/49856/ and several related patches before this final one.  Currently being RT-tested.   Looks good so far.  Closing.  Reopen if any significant concerns surface.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links