fbpx

Knowing just enough regex to be useful

A 10-minute, easy-to-remember primer on regexes so that you can use them.

5 min read


Regex. Regular expressions.

You may have heard of them, maybe you dabbled a bit with them. But they always seem to hard to get right, don’t they?

Let me try to make it simple.

Recognizing a regex

There are many ways to denote the start and end of a regular expression. But a very common one is to enclose them with the / symbol. Like so:

/bought milk/

This way, it’s easy to see where a regex starts and ends.

The simplest regex

Okay, so if we have a sentence (the string), such as:

Amelia bought milk and a milkshake from the store down the road

And a regex like the one we just looked at:

/bought milk/

It will match.

In this way, the regex is being used a bit like a search. The regex simply matches the string.

This regex is very simple, so by default, it is:

  • case sensitive
  • searches everywhere
  • finds only the first match
  • matches as much as possible

Many times, the defaults work just fine.

Simple modifiers

Controlling location

Now, to make the regex a little more precise, we can add modifiers to it.

Looking at the start of the string

To get the regex to look only from the start of the string, we give it a caret, or ^ symbol, like so:

/^bought milk/

This will no longer match our string “Amelia bought milk and a milkshake from the store down the road”.

However, this will match:

/^Amelia bought milk/

Because that matches the start of the string.

Looking at the end of the string

To get the regex to look only from the start of the string, we give it a caret, or $ symbol, like so:

/bought milk$/

This will no longer match our string “Amelia bought milk from the store down the road”.

However, this will match:

/down the road$/

Because that matches the end of the string.

Looking at both the start and end of the string

And they can be used in combination, like so:

/^bought milk$/

And you’re certainly getting the picture now. It won’t match.

However, this will match:

/^Amelia bought milk and a milkshake from the store down the road$/

Because that matches both the start and the end of the string. Which is really the entire string.

Flags

Flags such as these that we are going to talk about don’t affect the string to search for.

They’re outside search expression, and affect the behaviour of the regex engine itself.

Therefore, they are provided in a separate place in the regex expression.

Case-sensitivity flag

Regexes are, by default, case-sensitive.

If we want to disregard case-sensitivity, we can give it the i flag (for “insensitive”).

They go to the end of the expression, like so:

/BOUGHT mIlk/i

This will match our string, because the regex is now case insensitive.

Global flag

Remember the regexes, by default, only find you the first match?

If we want to find all matches, we give it the g flag (for global), like so:

/milk/i

This will match both occurrences of the pattern milk in our phrase:

Amelia bought milk and a milkshake from the store down the road

Escape characters

Regex expressions treat some characters as special.

You’ve already seen the ^ and the $ characters, and how these characters give special meaning to the regex expression.

So what happens if our string itself has one of those characters? We escape it using a \ symbol.

Take for example:

David spent $1,000 on a new iPhone

Suppose we want to match the $1,000 part of the string above.

To do so, we need to escape the special $ sign in our expression, and tell the regex that we are literally looking for the $, rather than the end of the string.

Our regex then looks like this:

/\$1,000/

With the \ symbol, the $ is now treated as a plain, ordinary character.

These are the common special characters that mean special meanings in a regex, which will need to be escaped with \ if you want them to be matched literally:

[ \ ^ % . | ? * + ( )

Just as another example:

David spent $1,000 on a [new] iPhone

To match the square brackets and the word “new”, our regex needs to look like:

/\[new]/

Options, using “or”

So far we’ve been matching stuff that is literally in our string, and nothing else.

Which isn’t very useful.

What if we can match variants? That’s where the fun starts.

Let’s say we have a bunch of different strings:

String 1:

Amelia bought milk and a milkshake from the store down the road

String 2:

Jonathan bought milk and donuts from the store across the road

Suppose we want to find “<whatever> from the store”.

One way is to use the “or” character, |.

We could write the regex as:

/(a milkshake|donuts) from the store/

And our regex would match both strings.

Notice also we grouped the words “a milkshake” and “donuts” together with parenthesis, or ( and ). This tells the regex engine that we want either “a milkshake” or “donuts”.

Character classes

But what if we want to match any English word before the words “from the store”, in a more general way?

We use character classes.

Character classes are enclosed using square brackets, [ and ].

Since we’re looking for English words, they must be made of alphabets. We can tell the regex engine to look for any single lowercase alphabets using the character class [a-z].

Notice that [a-z] matches a single lowercase alphabet? That’s not useful if we want to match a whole word, for instance.

So character classes are often used together with what we call an occurrence indicator.

Nevermind the name, it just tells us how many times the thing before it should match.

There are 3 very common occurrence indicators:

  • +: one or more
  • *: zero of more
  • ?: zero or one
  • {n}: exactly n times

So let’s combine.

If we want to match a lowercase word, we would use:

[a-z]+: one or more lowercase letters (that’s a lowercase word!)

If we want to a word of without caring about uppercase and lowercase, we would use:

[A-Za-z]+: one or more letters (that’s a word regardless of case!)

If we want to match numbers, we would use:

[0-9]+: any number

If we want to a alphanumeric “word” of without caring about uppercase and lowercase, we would use:

[A-Za-z0-9]+: any alphanumeric “word”

Some examples

So, for our strings:

String 1:

Amelia bought milk and a milkshake from the store down the road

String 2:

Jonathan bought milk and donuts from the store across the road

How do we match “<whatever> from the store”?

Simply:

/[a-z]+ from the store/

Suppose we have a credit card number, that can be expressed with or without dashes:

Credit card number 1:

1234–4568–8765–4321

Credit card number 1:

1234456887654321

We can create a regex that matches both forms, like so:

/[0-9]{4}[-]?[0-9]{4}[-]?[0-9]{4}[-]?[0-9]{4}/

The {4} matches a number exactly 4 times, and the [-]? makes the presence of the dash optional (? means zero or one).


Capturing text

Okay. So far we’ve been treating our regex like a search engine.

We’ve been asking it, basically, “does this, or does this not match?”

But that’s far from the power of regexes.

A regex can be used to pull information out. And that’s incredibly useful.

Suppose we have all of the following sentences:

My credit card number is 1234456887654321.
David said 4356-7813-5465-7891 is his credit card number.
Mary disclosed that her card number is 6578454565461324.
Jane told John to use 5345-1232-1212-3456 buy something.

(all the above card numbers are, obviously, fake and random)

Now, using what we already know, we can easily craft a regex that will match the credit card numbers in the 4 sentences above:

/[0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4}/

But we don’t have any tools to pull out those credit card numbers.

Groups

But, if we enclose our regex in parentheses, we form a group, that can be captured:

/([0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4})/

Different regex libraries will give you different ways to access the information, and that you can easily find out from reading the documentation of your regex library (or application that gives you regex powers).

However, commonly, it will be $1or \1.

We can combine capturing groups together with a more typical regex expression.

If we have these strings:

My credit card number is 1234456887654321.
Mary disclosed that her card number is 6578-4545-6546-1324.

We can ensure that the credit card number is preceded with the word “is”, like so:

/is ([0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4}[-]?[0–9]{4})/


Summary

And that sums up the very basics of regexes.

You should have quite a bit of tools to do simple regex matches to automate some searching or extraction tasks, or to understand the core parts of a regex when you see one.

If you’d like to experiment, or test your regexes, I have found the following site to be very useful:

Have fun!

Eugene Ching Founder of Qavar, an AI and cybersecurity company. We use machine learning to bring insights into your business, and defend you against digital threats.

Don't miss out. Find out how leveraging AI or automation can help you.

Subscribe to receive practical tips, advice and ideas on how AI, machine learning and technology can help you grow your business.