Skip to content Skip to sidebar Skip to footer

Fastest Way To Test For A Minimum Number Of Lines Or Tokens

I have twice now found myself wanting to know whether a Javascript string has a minimum number of lines but not wanting to wastefully split the string to find out. Both times took

Solution 1:

The solution is to test for a series of token/non-token units rather than trying to test for the right count of delimited units or the right count of tokens. If the token is a delimiter and you want a minimum number of delimited units, require a count of token/non-token units equal to one less than the required number of units. As we'll see, this solution has surprising performance.

Minimum line count

This function checks for a minimum number of lines, where \n delimits lines rather than strictly ending them, allowing for an empty last line:

functionhasMinLineCount(text, minLineCount) {
    if (minLineCount <= 1)
        returntrue; // always 1+ lines, though perhaps emptyvar r = newRegExp('([^\n]*\n){' + (minLineCount-1) + '}');
    return r.test(text);
}

Alternatively, \n can be assumed to end lines, rather than purely delimit them, making an exception for a non-empty last line. For example, "apple\npear\n" would be two lines, while "apple\npear\ngrape" would be three. The following function counts lines in this manner:

functionhasMinLineCount(text, minLineCount) {
    var r = newRegExp('([^\n]*\n|[^\n]+$){' + minLineCount + '}');
    return r.test(text);
}

String delimiters and tokens

More generally, for any unit delimited by a string delimiter:

var _ = require('lodash');

functionhasMinUnitCount(text, minUnitCount, unitDelim) {
    if (minUnitCount <= 1)
        returntrue; // always 1+ units, though perhaps emptyvar escDelim = _.escapeRegExp(unitDelim);
    var r = newRegExp('(.*?'+ escDelim +'){' + (minUnitCount-1) + '}');
    return r.test(text);
}

We can also test for the presence of a minimum number of string tokens:

var _ = require('lodash');

functionhasMinTokenCount(text, minTokenCount, token) {
    var escToken = _.escapeRegExp(token);
    var r = newRegExp('(.*?'+ escToken +'){' + minTokenCount + '}');
    return r.test(text);
}

Regular expression delimiters and tokens

We can generalize further by allowing unit delimiters and tokens to include regular expression characters. Just make sure the delimiters or tokens can unambiguously occur back-to-back. Example regex delimiters include "<br */>" and "[|,]". These are strings, not RegExp objects.

functionhasMinUnitCount(text, minUnitCount, unitDelimRegexStr) {
    if (minUnitCount <= 1)
        returntrue; // always 1+ units, though perhaps emptyvar r = newRegExp(
              '(.*?'+ unitDelimRegexStr +'){' + (minUnitCount-1) + '}');
    return r.test(text);
}

functionhasMinTokenCount(text, minTokenCount, tokenRegexStr) {
    var r = newRegExp('(.*?'+ tokenRegexStr +'){' + minTokenCount + '}');
    return r.test(text);
}

Computational cost

The generic functions work because their regular expressions do non-greedy matching of characters (notice the .*?) up until the next delimiter or token. This is a computationally expensive process of looking ahead and backtracking, so these take a performance hit relative to more hardcoded expressions such as found in hasMinLineCount() above.

Let's revisit the initial question of whether we can outperform splitting a string with regex testing. Recall that our only objective is to test for a minimum number of lines. I used benchmark.js to test, and I assumed that we know that a plurality of lines is required. Here is the code:

varBenchmark = require('benchmark');
var suite = newBenchmark.Suite;

var line = "Go faster faster faster!\n";
var text = line.repeat(100);
varMIN_LINE_COUNT = 50;
var preBuiltBackingRegex = newRegExp('(.*?\n){'+ MIN_LINE_COUNT +'}');
var preBuiltNoBackRegex = newRegExp('([^\n]*\n){'+ MIN_LINE_COUNT +'}');

suite.add('split string', function() {
    if (text.split("\n").length >= MIN_LINE_COUNT)
        'has minimum lines';
})
.add('backtracking on-the-fly regex', function() {
    if (newRegExp('(.*?\n){'+ MIN_LINE_COUNT +'}').test(text))
        'has minimum lines';
})
.add('backtracking pre-built regex', function() {
    if (preBuiltBackingRegex.test(text))
        'has minimum lines';
})
.add('no-backtrack on-the-fly regex', function() {
    if (newRegExp('([^\n]*\n){'+ MIN_LINE_COUNT +'}').test(text))
        'has minimum lines';
})
.add('no-backtrack pre-built regex', function() {
    if (preBuiltNoBackRegex.test(text))
        'has minimum lines';
})
.on('cycle', function(event) {
    console.log(String(event.target));
})
.on('complete', function() {
    console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run({ 'async': true });

Here are the results of three runs:

split string x 263,260 ops/sec ±0.68% (85 runs sampled)
backtracking on-the-fly regex x 492,671 ops/sec ±1.01% (82 runs sampled)
backtracking pre-built regex x 607,033 ops/sec ±0.72% (87 runs sampled)
no-backtrack on-the-fly regex x 581,681 ops/sec ±0.77% (84 runs sampled)
no-backtrack pre-built regex x 723,075 ops/sec ±0.72% (89 runs sampled)
Fastest is no-backtrack pre-built regex

split string x 260,962 ops/sec ±0.82% (85 runs sampled)
backtracking on-the-fly regex x 502,410 ops/sec ±0.79% (84 runs sampled)
backtracking pre-built regex x 606,220 ops/sec ±0.67% (88 runs sampled)
no-backtrack on-the-fly regex x 578,193 ops/sec ±0.83% (86 runs sampled)
no-backtrack pre-built regex x 741,864 ops/sec ±0.68% (84 runs sampled)
Fastest is no-backtrack pre-built regex

split string x 262,266 ops/sec ±0.76% (87 runs sampled)
backtracking on-the-fly regex x 495,697 ops/sec ±0.82% (87 runs sampled)
backtracking pre-built regex x 608,178 ops/sec ±0.72% (88 runs sampled)
no-backtrack on-the-fly regex x 574,640 ops/sec ±0.92% (87 runs sampled)
no-backtrack pre-built regex x 739,629 ops/sec ±0.72% (86 runs sampled)
Fastest is no-backtrack pre-built regex

All of the regex tests are clearly faster than splitting the string to check the line count, even the backtracking ones. I think I'll do the regex testing.

Post a Comment for "Fastest Way To Test For A Minimum Number Of Lines Or Tokens"