JavaScript Sets and ranges [...]

JavaScript Sets and ranges [...]

JavaScript Sets and ranges [...]


Let’s get deeper into the details of regular expressions. In this chapter, we will show you how to use sets and ranges in JavaScript.

Putting several characters or character classes inside square brackets allows searching for any character among the given.

To be precise, let’s consider an example. Here, [lam] means any of the given three characters 'l', 'a', or 'm'. It is known as a “set”. You can use them with regular characters in a regexp like this:

// find [w], and then "Web"

console.log("Welcome to Web".match(/[w]3docs/gi)); // "Web"

Although multiple characters exist in the set, they match exactly a single character in the match.

So, there are no matches in the example below:

// find "W", then [e or B], then "ocs"

console.log("Web".match(/W[e]ocs/)); // null, no matches

The pattern looks for W, then one of these letters [W], and, finally, ocs.

So, here could be a match for Webor Web.

Ranges

Square brackets can also include the so-called character ranges.

For example, [a-m] is a character in range from “a” to “m”, and [0-7] is a digit from “0” to “7”.

Let’s see an example where “x” is followed by two digits or letters from A toF.

console.log("Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g)); // xAF

So, in the example above,[0-9A-F] includes two ranges: it looks for a character that is either a digit from “0” to “9” or a letter from “A” to “F”.

In case you want to search for lowercase letters, you can either add the a-f range or add the e flag.

Inside […], you can also use character classes.

For example, if you try to search for the character \w or a hyphen -, then the set will be [\w-]. You can also combine different classes such as [\s\d].

Multilanguage \w

As \w is a shorthand for [a-zA-Z0-9_] it’s not capable of finding Cyrillic letters, Chinese hieroglyphs, and so on.

A more universal pattern can be written. It can search for wordy characters in every language. With Unicode properties, it’s quite easy:

[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}].

Let’s interpret it. Like \w, it includes characters with Unicode properties, like here:

  • for letters -Alphabetic (Alpha).
  • for accents - Mark (M).
  • for digits - Decimal_Number (Nd).
  • for underscore and similar characters -Connector_Punctuation (Pc).
  • for ligatures such as Arabic are used two special codes 200c and 200d - Join_Control (Join_C).
  • Here is how it will look like:

let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;

let str = `Welcome 你好 123`;

// finds all the digits and letters:

console.log(str.match(regexp));

    Excluding Ranges

    There is another type of ranges, besides normal ranges: the excluding ranges that look like this [^…] . They are signified by a caret character ^ at the start and correspond to any character except for the given ones.

Any character except for letters, spaces, and digits is searched for in the example below:

console.log("web@gmail.com".match(/[^\d\sA-Z]/gi)); // @ and .

    Escaping in […]

    As a rule, when a special character needs to be found, it should be escaped like \.. If a backslash is necessary, then \\ is used.

// No need to escape
let regexp = /[*().^+]/g;
console.log("1 + 2 * 3".match(regexp)); // Matches +, -

console.log('Ģ'.match(/[ĢÇ]/)); // demonstrates a strange character, like [?]

    The result is not correct as regular expressions by default don’t recognize surrogate pairs.

    The engine of the regexp thinks that [ĢÇ] are four characters, not two:

    1. the left half of Ä¢(1).
    2. the right half of Ä¢(2).
    3. the left half of Ã‡(3).
    4. the right half of Ã‡(4).

    Their codes can be seen as follows:

for (let i = 0; i < 'ĢÇ'.length; i++) {
  console.log('ĢÇ'.charCodeAt(i));
};

    So, the left half of Ã‡ is found and shown.

    Adding the flag u will make it proper:

console.log('Ģ'.match(/[ÇĢ]/));//Ģ

    The same thing happens while searching for a range like [Ç-Ä¢].

    Forgetting to add the u flag will lead to an error, like this:

console.log(' Ģ'.match(/[Ç-Ģ]/));
So, the pattern will look properly with the u flag:

// search for characters from Ç to Ģ
console.log('Ģ'.match(/[Ç-Ģ]/u)); // Ģ

The error happens because without the u flag surrogate pairs are recognized as two characters.

Reactions

Post a Comment

0 Comments

close