Last modified: 2011-03-13 18:04:32 UTC
To make things faster, please change any capturing groups to non-capturing ones when compiling the regex from the blacklist page. This also eliminates the need for people to know and remember to use non-capturing groups. I can't think of any use for capturing groups in the spam blacklist at all, but it's possible I'm not imaginative enough. At the least, we're not using that in any blacklist I know of, so it can't be terribly useful.
Each regex is a capturing group that captures the url that triggered the list, which is passed through to and displayed in the error message. Removing these would make it impossible to resolve spam matches, because you wouldn't know which url needed to be removed. I assume you mean automatically make any capturing groups defined *within* a regex line non-capturing? That could certainly be done.
(In reply to comment #1) > Each regex is a capturing group that captures the url that triggered the list, > which is passed through to and displayed in the error message. Removing these > would make it impossible to resolve spam matches, because you wouldn't know > which url needed to be removed. I assume you mean automatically make any > capturing groups defined *within* a regex line non-capturing? That could > certainly be done. > Yes, that's what I meant :D
WONTFIX as "premature optimization is the root of all evil"?
(In reply to comment #3) > WONTFIX as "premature optimization is the root of all evil"? > Seth & VVV did some testing and determined it was a significant reduction to use non-capturing groups, which is why we use them (well, that and someone was very persistent in annoying people till they did it). It'd be better to have that done by the extension than by humans. Also, then when someone next whinges at us to use non-capturing groups for optimization we can tell them to stuff it because the extension does it internally (I wonder what the next thing well be that we urgently need to do to avoid a few ms 9.9)
Are we talking ms or us here? I'm guessing us. Besides, wouldn't capturing groups be necessary for something like (['"]).*?\1 or whatever? They aren't only used for replacements. This reduces functionality for at *best* some dubious microoptimization effort that has no real-world benchmarks attached. Real-world means saying how much user or real time it measurably adds to actual page loads on a real site, not "it takes 97% less time if I make a 50M regex solely out of capturing expressions". Also not "some people did some testing and said it was significant". Still say WONTFIX.
(In reply to comment #5) > (['"]).*?\1 > I've never seen backreferences used in any spam blacklist. FWIW, VVV said he used the Meta blacklist, IRRC. CCed him to verify. But, if there's definitely no benefit, then this should be WONTFIXED and I'll stop bothering to use non-capturing groups manually too, which will be lovely :D
Guys, can you do a quick test and list out the *actual* speeds with the actual blacklist regexes here? This seems like it's an easy thing to resolve empirically. :)
Created attachment 6136 [details] Test script I benchmarked it (benchmarking tools attached) and got following results (on Toolserver): grouping: 0.0158297939301 s nongrouping: 0.0151363241673 s
So that's less than a 5% difference, absolute difference of 7ms in the test case (151 vs 158ms total). The 'spambl' file seems to be missing from the test case archive, so I can't reproduce results locally. Is functionality unchanged with the switch to nongrouping, or would this require changing other components?
That's less than a 5% difference doing *nothing else*. It's *less than a millisecond*. It would probably sink to more like 0.5% of the entire page save process, a negligible gain. On the other hand, you're removing functionality, whether or not it may be unused at present. Still recommend WONTFIX. (In reply to comment #9) > So that's less than a 5% difference, absolute difference of 7ms in the test > case (151 vs 158ms total). That's 15.8 vs 15.1 ms total -- a difference of 0.7 ms, or 700 us. > The 'spambl' file seems to be missing from the test > case archive, so I can't reproduce results locally. It's created by running run-tests, which also runs the benchmark. On localhost I get about the same results (~1ms difference, 5% gain). > Is functionality unchanged with the switch to nongrouping, or would this > require changing other components? Functionality is reduced -- it's no longer possible to use back-references. It also gives the Principle of Least Surprise a severe smack on the head (who would expect something to take raw regexes and then change their functionality weirdly?).
Woops, got a digit wrong there. :) Yes, 0.7ms not 1ms. Looks like there's not a compelling reason to poke this then for now. Resolving WONTFIX. Reopen later if the difference becomes more compelling and there's a cleaner way to do it without potentially breaking backref use.