Introduction to Regular Expressions using JavaScript - Part 2

Published on 10th of September 2008. Copyright Tavs Dokkedahl. Displayed 1210 time(s)

Working with character classes

If you have read part 1 you should be able to test strings for simple patterns like whether a string contain specific characters and where in the strings those characters should appear.

This part of the tutorial is concerned with what we call character classes.

A character class is a collection of characters and provides a short hand syntax for matching several characters.

Consider the following example.
 1 // Match any string which starts with the letters 'ab'
 2 var rgx = /^ab/; 
 3 // Will match 
 4 'abolishment'
 5 // but not
 6 'banana'

The regex only matches strings starting with 'ab' and in that order.

To match a string which starts with either a or b we can create a character class. For this we use [ and ] (square brackets).

 1 // Match any string which starts with either 'a' or 'b'
 2 var rgx = /^[ab]/; 
 3 // Will match 
 4 'abolishment'
 5 'banana'

Now any strings which starts with a letter inside the brackets is matched.

To find any strings which starts with a 'c' followed by either an 'a' or an 'o' we can write

 1 // Match any string which starts with c followed
 2 // by either 'a' or 'o'
 3 var rgx = /^c[ao]/; 
 4 // Will match 
 5 'cat'
 6 'corn'
 7 // but not
 8 'celtic'

Character classes can be negated - that is they can be reversed so the meaning becomes 'any character not inside the brackets'. To do this we place a caret (^) as the first character inside the class

 1 // Match any string which does not contain a 'g'
 2 var rgx = /[^g]/; 
 3 // Will match 
 4 'cat'
 5 'corn'
 6 // but not
 7 'goat'
 8 'ugly'

If we want to match strings not starting with an 'f' we can do

 1 // Match any string which does not start with an 'f'
 2 var rgx = /^[^f]/; 
 3 // Will match 
 4 'afraid'
 5 'after'
 6 // but not
 7 'family'
 8 'fond'

If we want to match a sequence of letters we can use a hyphen (-). This is practical as writing classes like [abcdefghifklmn] quicly becomes tedious.

The same class can be written as [a-n] meaning any character between 'a' and 'n' both inclusive.

 1 // Match any string which contains a character in the range
 2 // a to n
 3 var rgx = /[a-n]/; 
 4 // Will match 
 5 'anchor'
 6 'senses'
 7 // but not
 8 'sour'
 9 'yo'

You can mix ranges as in

 1 // Match any string which contains a character in the range
 2 // f to i or n to t
 3 var rgx = /[f-in-t]/; 
 4 // Will match 
 5 'nobody'
 6 'cent'
 7 // but not
 8 'apple'

Ranges are not limited to letters. You can have any Unicode character in a class and mix lower and uppercase. To allow the letters 'a', 'f', 'S' to 'Z' and all digits

 1 // Match any string which contains an 'a' or an 'f' of an
 2 // uppercase character in the range 'S' to 'Z' or any digit
 3 var rgx = /[afS-Z0-9]/; 
 4 // Will match 
 5 'first'
 6 'Sentry'
 7 '2nd'
 8 // but not
 9 'zen' (only uppercase Z i matched)
10 'generic'

You can not mix classes and negated classes. A class is either negated or not and you can not nest classes.

Predefined character classes

JavaScript comes with a collection of predefined classes. These classes are denoted by a backslash (\) followed by a letter identifying the class.

The predefined classes covers the most used ranges and are

Character What the class will match
. A single dot will match any character except for the newline (\n) character.
\w Any alphanumeric character (ignoring case), all digits and the underscore (_). \w is the same as [a-zA-Z0-9_]
\W The opposite of \w. That is any character which is not in \w. The same as [^\w] or [^a-zA-Z0-9_]
\d Any digit. The same as [0-9]
\D Any character not a digit. The same as [^\d] or [^0-9]
\s Any Unicode whitespace character (space, tab, newline).
\S Any non-space Unicode character. The same as [^\s]

You can use these classes to match a string which starts with an 'a' followed by a space followed by any character which is not a digit

 1 // Match any string which starts with an 'a', is followed
 2 // by a space an not folloed by a digit
 3 var rgx = /^a\s\D/; 
 4 // Will match 
 5 'a song'
 6 // but not
 7 'apple'
 8 'a 45'

The predefined classes can also be used in your own classes as in

 1 // Match any string which ends in with a blank character or a digit
 2 var rgx = /[\s\d]$/; 
 3 // Will match 
 4 'a song '
 5 'Mainstreet 56'
 6 // but not
 7 'apple'
 8 'a sentence'

Finally an example of how to match any character

 1 // Match any string which starts with 'H' followed by any character
 2 // followed by an 'l'
 3 var rgx = /^H.l/; 
 4 // Will match 
 5 'Helium'
 6 'Halo'
 7 'Holistic'
 8 // but not
 9 'Haven'
10 'Hive'

User form example

So lets see just how much we can validate with what we have learned

User form Name
First and last name - letters only - max. 64 chars

Address
Streetname followed by number - max. 48 chars

Zip code
Only digits - max. 8 chars

City
Only letters - max. 32 chars

Phone
Phone number with possible prefix - max. 16 chars

Email
Lots of different characters - max. 64 chars.

I have filled in the form with the regexs we can make for now.

The above is not enough to validate what we want. None of the regexs can accept the required number of characters. We don't know how to make a prefix for a phone number optional and we certainly don't know how to validate the complex email format.

The regexs in the form translate to

Field Regex Translation
Name /[a-zA-Z]\s[a-zA-Z]/ A single letter in the range a-z (both lowercase and uppercase) followed by a single space followed by the same range again.
Address /[a-zA-Z]\s\d/ A single letter in the range a-z (both lowercase and uppercase) followed by a single space followed by a single digit.
Zip /\d/ A single digit
City /[a-zA-Z]/ A single letter in the range a-z (both lowercase and uppercase).
Phone /\d/ A single digit.
Email /[a-zA-Z]@[a-zA-Z]/ A single letter in the range a-z (both lowercase and uppercase) followed by a single @ followed by the same range again.

Next up we will learn how to count characters and look at matching repeating characters

« Part 1 Part 3 » 

Leave a comment

Name

Email (if you want a response)

Comment (no HTML)

Spam challenge
Sorry to bother you but spam is a royal pain, so please answer this simple question to verify that you are in fact human(oid)

Question: "What is the 3 letter acronym for Cascading Style Sheets?"

Answer:

Resources

Declaring a regex
var rgx = /A/;
Characters with special meaning
^  Beginning of string / Negation
$  End of string
[  Start of character class
]  End of character class
.  Any character
Character classes
[ ... ]  Any character inside the brackets
[^ ... ] Any character not inside the brackets
Predefined character classes
.   Any character except for newline (\n)
\w  Any ASCII character. The same as [a-zA-Z0-9_]
\W  Any character wihch is not an ASCII character.
    The same as [^a-zA-Z0-9_]