## A tutorial explaining the functional regular expression matching algorithm

Eating red cherries gives you the ability to eat blue ghosts. The idea that derivatives can be used to create a regular expression matching algorithm is almost as ridiculous. Let me explain how this algorithm works and how it relates to Pac-Man.

In 1964, Brzozowski published the first paper about Derivatives of Regular expressions. This is by far one of my favourite algorithms. Using derivatives of regular expressions, we can implement an algorithm to do regular expression matching. This algorithm is very:

- simple
- functional
- easily extendible with your own operators

It is also a great way to introduce you to other functional concepts, such as memoization, smart constructors, laziness, and fixed points. It can even be extended to handle the matching of context-free grammars, trees, and generic lists.

In this article, I will show you how we can match a string with a regular expression using only two pure functions and some Pac-Man analogies. If you prefer, you can watch the following video instead of reading the article as it covers the same material:

Here is the implementation of the nullable function. The left side represents values that are passed into the function, and the right side represents the implementation of the function in that case. The red ghosts represent false, and the blue ghosts represent true:

- The empty set doesn’t match the empty string since it doesn’t match any string.
- The empty string matches the empty string because it only matches the empty string.
- The character
`a`

does not match the empty string because it only matches the character`a`

. - If we have a logical
`or`

, we have to check both sides. If either matches the empty string, then the logical`or`

matches the empty string. - For the
`concatenation`

of two regular expressions to match the empty string, they both have to match the empty string. - And finally, if we have
`zero or more`

of something, then that includes zero, which means it always matches the empty string.

Let’s run an example through our function. Given the regular expression `(b | a*)`

.

- Our top operator is the
`or`

this means we have to check the nullability of the left and right sides:`b`

and`a*`

. - We check and see that the character
`b`

on the left side is not nullable:`false`

. - Then we check and see that the
`a*`

on the right side is nullable:`true`

. - Now we got back
`false`

and`true`

and we can`or`

them to get`true`

.

So the expression `(b|a*)`

is nullable. This makes sense since it will match the empty string. Before we move on to the derivative function, let’s do some more exercises:

## Nullable exercises

Try to walk through the implementation and check if the following regular expressions are nullable. You can click on them to check your answer:

Before we look at the implementation of the function, let’s look at examples of a derivative. Here we are going to take the derivative of a few regular expressions, all with respect to the character `a`

:

- The regular expression that is left to match after
`a*`

has eaten an`a`

pple is still`a*`

. - The derivative of
`ab*`

with respect to`a`

is`b*`

, since we have already matched the prefix`a`

. - The derivative of
`(a|b)b`

with respect to`a`

is`b`

. - The derivative of
`b|(a*b)`

with respect to`a`

is`a*b`

. The left`b`

didn’t match, so we could throw it away and the`a`

was consumed by the`zero or more`

`a`

’s on the right. - Next, we have
`ab*`

, this one is slightly tricky. After it eats the apple, the regular expression that is left to match is`b(ab)*`

. Since we have only matched the`a`

, we are expecting to see at least one more`b`

.

I hope these examples gave some intuition for how the derivative function will be defined. We need to define the derivative function that takes a regex, matches a single character, and returns a new regex. Here we define the derivative function with respect to the character `a`

. The left side represents the input to the function, and the right side represents the function’s implementation.

- The derivative of the empty set is always the empty set. There is no way of recovering since the empty set does not match anything.
- The derivative of the empty string concerning any character is the empty set. It wasn’t expecting to match a character. It will only match the empty string.
- The derivative of a single character to a similar character (in this case, the
`a`

pple) is the empty string since after it has matched itself, all that is left to match is the empty string. - The derivative of a character with respect to a different character that is not equal, in this case, the
`b`

anana, is the empty set since we have not matched the specific character. - The derivative of an
`or`

expression is the`or`

of the derivatives. It simply pushes the problem down to its children. - The derivative of the
`concat`

expression has to consider whether it can skip over the first expression. It can only skip over the first expression if the first expression matches the empty string and is nullable. So the first thing we do is check this. Let’s think about the case where it cannot skip over the first expression when the expression`r`

is not nullable. Then the derivative is the derivative of the first expression`concatenated`

with the second expression`s`

. If we can skip over the first regex, we have to consider an alternative that is only the derivative of the second expression. We can then`or`

the two alternatives of skipping over`r`

and not skipping over`r`

and return that as a result. - Finally, we have the
`star`

operator. It matches an expression zero or more times. Since we are being passed a character, this is not the zero case. So we have to consider the`one or more`

case. This means we have to take the derivative of the expression inside the`star`

and`concatenate`

it again with the`zero or more`

expression.

As shown earlier, we can combine the derivative function with the nullable function to create a regular expression matching function. The `concatenation`

and `zero or more`

rules can take a while to grasp fully. So, let’s do some examples.

## Derivative example 1

Let’s take the derivative of `(ab)*`

with respect to `a`

.

`(ab)*`

is a `zero or more`

expression, so we look to the `zero or more`

rule. We see this requires taking the derivative of the expression inside the `star`

.

This is the `concatenation`

of `a`

and `b`

. So we check whether the left side is nullable, and the character `a`

is not nullable. This means we can’t skip over it. We have to take the derivative of `a`

with respect to `a`

. But that is the empty string, so if we `concatenate`

the empty string with the right side, which was `b`

, we get `b`

.

Now, we recurse back to the `zero or more`

, remember we took the derivative of `ab`

with respect to `a`

and got back a `b`

. Now we can concatenate that with `(ab)*`

again and we get `b(ab)*`

.

## Derivative example 2

Let’s take the derivative of `(a*ba)`

with respect to `b`

.

`a*`

is concatenated with`ba`

so we take a look at the concatenation rule.- We check whether the left side
`a*`

is nullable, which it is. This means we can skip over it, which also means we have to create an`or`

of two derivatives. - The left side ends up not matching, since
`a*`

does not match`b`

. - Luckily we have an alternative the
`ba`

. The derivative of`ba`

with respect to`b`

is and`a`

.

So, we end up just with an `a`

.

I have skipped over some detail here. Consider it an exercise to check my work by walking through the function yourself.

## Derivative exercises

Try to walk through the implementation and check what the derivatives of the following regular expressions with respect to `b`

. You can click on them to check your answer:

I hope you now understand why eating red cherries gives you the ability to eat blue ghosts and how to implement a regular expression matcher using the derivative algorithm.

We have covered the basic working algorithm here, but there are many ways to make this algorithm even better with very small tweaks. We cheated and glossed over simplification rules in this post, using them without explaining them, which will become especially obvious if you walk through the exercises. We also haven’t discussed how you can use memoization to build an efficient automaton lazily.

We can also easily extend the algorithm to include new operators like, `not`

, `interleave`

and even support Context-Free Grammars. I will discuss some of these topics in the next article.

In the meantime, I would love to see your implementation of this algorithm in a programming language you are most comfortable with. Please send me a link in the comments.