Introduction to Regular Expressions using JavaScript - Part 3

Published on 10th of September 2008. Copyright Tavs Dokkedahl. Displayed 1436 time(s)

Counting and repeating characters

This part of the tutorial will teach you to count characters and use repeating patterns.

In part 1 and part 2 we have learned about matching single characters, groups of characters and a little about character position in the string (start and end).

With repetition and counting you should be able to see just how powerfull regexs can be.

Repetition

There are several choices for specifying how many of a kind you would like to match. Using an asteric (*) we specify zero or more, a question mark (?) specifies zero or one and a plus sign (+) specifies 1 or more.

 1 // Match the word 'color' or 'colour'
 2 var rgx = /colou?r/;
 3 // Will match
 4 'color'
 5 'colour' 

By this we can match both the british and the american way of spelling the word. The question mark specifies that the 'u' is optional - there can be zero or one 'u'. The *, ? and + are always relative to the character immediatly before.

Now if we want to match a string like '15 colours' but want to replace 15 for any number we can do

 1 // Match at least 1 digit followed by a space followed 
 2 // by the word 'color' or 'colour'
 3 var rgx = /\d+\scolou?r/; 
 4 // Will match
 5 '13 color'
 6 '1235234 colour'
 7 // but not
 8 '34   color' (too many spaces)
 9 '99colour' (space is missing)

This reads as the character class \d (any digit 0 to 9) one or more times followed by a space (\s) and followed with the optional spelling og color.

If you are matching the above but don't care about how manu spaces there are between the number and the word color you can do this

 1 // Match at least 1 digit followed by any number of spaces
 2 // (or no space) followed by the word 'color' or 'colour'
 3 var rgx = /\d+\s*colou?r/; 
 4 // Will match
 5 '13 color'
 6 '1235234 colour'
 7 '34   color'
 8 '99colour'

The asteric is often useful for leading and trailing spaces. If you want to make sure a form field only contains digits but allow the user to have spaces before and after the digits you can do

 1 // Match any number of spaces followed by at least one digit
 2 // followed by any number of spaces
 3 var rgx = /\s*\d+\s*/; 
 4 // Will match
 5 '13'
 6 '  12'
 7 '34   '
 8 '  99  '

The *, ? and + is really just shorthand for specific cases of the general repetition syntax. Counting the number of repetitions is done using { and } (curly brackets). You can use these in three ways

Character Number of matches
{x,y} Match at least x and at most y times
{x,} Match at least x times
{x} Match exactly x times

From the table we get that

Character Equivalent
? Is the same as {0,1}
* Is the same as {0,}
+ Is the same as {1,}

To match exactly 8 digits we simply write

 1 // Match exactly 8 digits 
 2 var rgx = /\d{8}/; 
 3 // Will match
 4 '12345678'
 5 '45678912'
 6 // but not
 7 '34'
 8 '99123637687687'

To match at most 32 characters in the range a-z

 1 // Match at most 32 characters in the range a-z 
 2 var rgx = /[a-z]{0,32}/; 
 3 // Will match
 4 'The small town'
 5 ''
 6 // but not
 7 'A sentence with too many characters'

Matching a string starting with between 7 and 9 digits followed by 1 or more spaces and then followed any character exactly 4 times is done with

 1 var rgx = /\d{7,9}\s+.{4}/; 
 2 // Will match
 3 '12345678   Alpha'
 4 '4567891 City'
 5 // but not
 6 '34 snakes'
 7 '4567891 crimson'

To test wheter the 4th chracter is an 'o'

 1 // Match a string which contains a 'u' as the 4th character
 2 var rgx = /^.{3}u/; 
 3 // Will match
 4 'Columbia'

User form example

Now that we can count we can greatly increase the efficiency of our form validation

User form Name
First and last name - letters only - max. 64 chars

Address
Streetname followed by number - max. 48 chars

Zip code
Only digits - max. 8 chars

City
Only letters - max. 32 chars

Phone
Phone number with possible prefix - max. 16 chars

Email
Lots of different characters - max. 64 chars.

The regexs are not complete yet but they are certainly better.

Names are unlikely to be smaller than 2 letters and town names are probably 4 letters long. There are room for more than one space between first and last name etc. We still can't provide an optional prefix as they usually start with a plus sign and we can't use the plus sign as it has a special meaning. The same is the case with the email address which requires a sinlge dot.

The regexs in the form translate to

Field Regex Translation
Name /[a-zA-Z]{2,}\s+[a-zA-Z]{2,}/ 2 or more letters in the range a-z (both lowercase and uppercase) followed by 1 or more spaces followed by the same range again.
Address /[a-zA-Z]{2,}\s+\d+/ 2 or more letters in the range a-z (both lowercase and uppercase) followed by 1 or more spaces followed by 1 or more digits.
Zip /\d{4,8}/ Bewteen 4 and 8 digits.
City /[a-zA-Z]{4,32}/ Between 4 and 32 letters in the range a-z (both lowercase and uppercase).
Phone /\d{8,16}/ Between 8 and 16 digits
Email /[a-zA-Z]+@[a-zA-Z]+/ 1 or more letters in the range a-z (both lowercase and uppercase) followed by a single @ followed by the same range again.

Moving on we will work with special characters and grouping.

« Part 2 Part 4 » 

Leave a comment

Name

Email (if you want a response)

Comment (no HTML)

Spam challenge
Sorry to bother you but spam is a royal pain, so please answer this simple question to verify that you are in fact human(oid)

Question: "What is the 3 letter acronym for World Wide Web?"

Answer:

Resources

Declaring a regex
var rgx = /A/;
Characters with special meaning
^  Beginning of string / Negation
$  End of string
[  Start of character class
]  End of character class
.  Any character
?  0 or 1 times
*  0 or more times
+  1 or more times
{  Start of repetition
}  End of repetition
Character classes
[ ... ]  Any character inside the brackets
[^ ... ] Any character not inside the brackets
Predefined character classes
.   Any character except for newline (\n)
\w  Any ASCII character. The same as [a-zA-Z0-9_]
\W  Any character wihch is not an ASCII character.
    The same as [^a-zA-Z0-9_]
  
Repetition
{x,y} At least x and at most y times
{x,}  At least x times
{x}   Exactly x times