Catastrophic Backtracking

Catastrophic Backtracking

Catastrophic Backtracking



At first sight, some regular expressions seem simple but may take too much time to execute. That can even make the engine of JavaScript to “hang”. In such cases, the browser suggests killing the script and reloading the page. But, it’s not a good idea. for JavaScript engine, it can become a fragility.

In this chapter, we will show you ways to overcome such situations.

An Example of a Regexp

For a better understanding of the issue, let’s start at examples.

Imagine having a string and intending to check whether it involves words with an optional \s? space after each of them. You can use the ^(\w+\s?)*$ regexp, specifying 0 or more words, like this:

let regexp = /^(\w+\s?)*$/;

console.log(regexp.test("Welcome to Web")); // true

console.log(regexp.test("Bad characters: $@#")); // false

This code works, and the result is correct. But, it takes too much time on certain things.

Running the code below, there won’t be anything, as the JavaScript engine will “hang”. After a while, it will suggest reloading the page, so you should be careful:

let regexp = /^(\w+\s?)*$/;

let str = "The input string, that takes a lot of time or to hang when makes this regex";

// taking a lot of time

console.log(regexp.test(str));

Not all regular expression engines can handle it.

Words and Strings

So, the reason of the hanging is that a word may be represented as one \w+ or multiple:

(input)

(inpu)(t)

(inp)(u)(t)

(in)(p)(ut)

...

As the string ends with an exclamation sign !, it’s obvious that there is no match. But, in the end, the regular expression considers a \w character or a space \s. But the engine is not aware of that.

As there are too many combinations to search for, it takes too much time.

Solutions

There exist two primary solutions to the problem.

The first solution is to lower the number of possible combinations. For that purpose, you can just rewrite the regexp, using its equivalent, like in the example below:

let regexp = /^(\w+\s)*\w*$/;

let str = "The input string, that takes a lot of time or to hang when makes this regex";

console.log(regexp.test(str)); // false

As you can see in the example above, the * goes after \w+\s instead of \w+\s?.

Representing one word of the string with multiple successive \w+ becomes impossible.

The (\w+\s?)* pattern can match the word string as two \w+, like this:

\w+\w+

string

In this case, there may be \w+\s or \w+\s\w+\s, and not \w+\w+.

Preventing Backtracking

The second solution is forbidding to backtrack for the quantifier.

The engine of the regexp tries different combinations that can be wrong for a human.

For instance, it’s obvious that in the (\d+)*$ regexp + couldn’t backtrack.

Replacing one \d+ with two separate \d+\d+ will not change anything:

\d+........

(123456789)!

\d+...\d+....

(1234)(56789)!

To prevent problems, modern engines use greedy quantifiers that don’t backtrack.

Also, there are so-called “atomic capturing groups” used to disable backtracking in parentheses. However, they are not supported by JavaScript. But, another way is efficiently used by JavaScript.

Lookahead

Lookahead is used for preventing backtracking.

Let’s interpret how it works:

  1. Lookahead ?= searches for the longest word \w+, beginning at the current position.
  2. The parentheses’ contents with ?=... is not remembered by the engine. So, wrapping into parentheses will make the engine memorize the contents.
  3. It is referenced in the pattern as \1 by all of us.

So, when there is a word \w+, you should match it as \1.

The reason is that a lookahead detects a word \w+ as a whole and you capture it in the pattern using \1. Essentially, a possessive plus + quantifier is implemented, capturing the whole world and not a part of it.

For example, in the word JavaScript, it can not only match, leaving out Java to correspond to the best part of the pattern.

Let’s see the following example:

console.log("JavaScript".match(/\w+Script/)); // JavaScript

console.log("JavaScript".match(/(?=(\w+))\1Script/)); // null

The first option of the example above, the \w+ captures the word JavaScript totally, then + backtracks it character by character, trying to match the rest of the pattern, till \w+ matches Java. In the second option, (?=(\w+)) searches for and detects JavaScript, which is included in the pattern as a whole by \1. So, no way remains for finding Script after it.

If you put a more complicated regexp to (?=(\w+))\1 rather than \w, you should prevent backtracking for + after that.

In the example below, lookahead is used for preventing backtracking:

let regexp = /^((?=(\w+))\2\s?)*$/;

console.log(regexp.test("Welcome to W3Docs")); // true

let str = "The input string, that takes a lot of time or to hang when makes this regex";

console.log(regexp.test(str)); // false, works and fast!

In the example above, instead of \1 is used \2 as additional outer parentheses are there. To prevent a mess with numbers, you can name the parentheses, like here:

// parentheses are named ?<word>, referenced as \k<word>

let regexp = /^((?=(?<word>\w+))\k<word>\s?)*$/;

let str = "The input string, that takes a lot of time or to hang when makes this regex";

console.log(regexp.test(str)); // false

console.log(regexp.test("A correct string")); // true

Summary

In this chapter, we covered a common problem in JavaScript, known as “catastrophic backtracking”.

Here we overviewed the two main solutions to that problem: lowering the possible combinations count and preventing backtracking.

The most common and efficient way to prevent backtracking is using lookahead.

Reactions

Post a Comment

0 Comments

close