Last modified: 2013-02-20 16:32:02 UTC
The current tokenizer performs a single pass and does not yield to other async tasks until it is done. This queues up a lot of tokens and async actions at once. It would be more efficient to cooperatively yield after each top-level block or so, so that some async processing can already happen as soon as the data becomes available. A simple process.nextTick call after each top-level block and a new offset parameter to the tokenizer to re-start tokenization at the given offset are probably all that is needed to achieve this.
tsr values on tokens also need to be updated for subsequent blocks, as the internal offset in the tokenizer will always be zero-based.
Implemented in https://gerrit.wikimedia.org/r/#/c/49856/ and several related patches before this final one. Currently being RT-tested. Looks good so far. Closing. Reopen if any significant concerns surface.