Derivatives of Regular Expressions Explained Using Pac-Man | by Walter Schulze

A tutorial explaining the functional regular expression matching algorithm

images by author | ghost and cherries from Pac-Man

Eating red cherries gives you the ability to eat blue ghosts. The idea that derivatives can be used to create a regular expression matching algorithm is almost as ridiculous. Let me explain how this algorithm works and how it relates to Pac-Man.

In 1964, Brzozowski published the first paper about Derivatives of Regular expressions. This is by far one of my favourite algorithms. Using derivatives of regular expressions, we can implement an algorithm to do regular expression matching. This algorithm is very:

  • simple
  • functional
  • easily extendible with your own operators

It is also a great way to introduce you to other functional concepts, such as memoization, smart constructors, laziness, and fixed points. It can even be extended to handle the matching of context-free grammars, trees, and generic lists.

In this article, I will show you how we can match a string with a regular expression using only two pure functions and some Pac-Man analogies. If you prefer, you can watch the following video instead of reading the article as it covers the same material:

First, let’s do a quick review of regular expressions to make sure we are on the same page. The expression a(a|b)* matches a string that starts with an a, which is followed by any number of a’s and b’s.

  • The string ab will match a(a|b)*. We’ll indicate this with an edible blue ghost.
  • The string aabbba also matches a(a|b)* since it starts with an a and is followed by several a’s and b’s.
  • Next, the string ac does not match a(a|b)*, since the regex does not match any c’s and our regex does not do any substring matching. We’ll indicate this using a red ghost that is chasing the Pac-Man.
  • Finally, the string ba also does not match a(a|b)* because it doesn’t start with an a.

Now that our quick recap is done, onto the algorithm.

Before delving into the details, let’s get an overview of how this algorithm works. I have devised a weird game of Pac-Man where you only get to eat the ghosts if you eat the fruit in a sequence that matches the regex. Our Pac-Man represents the regex aba*. It has the following string of fruit to eat: an apple, then a banana, and then an apple: aba.

  1. When we start, the ghost is chasing us, and the regular expression we have left to match is aba*.
  2. We eat the first apple, and the regular expression that we now have left to match is ba*. The ghost is still chasing us since the fruit we have eaten so far, the apple, does not match the regex.
  3. Next, we eat the banana. The regex we then have left to match is a*. Now the ghost starts running away because, technically, ab already matches aba*.
  4. We can try to eat the ghost or eat another apple, in which case the regular expression we have left to match is still a*. The ghost is still running away since aba also matches the regular expression aba*.
Animation of Pac-Main eating an apple, a banana, and another apple

Let’s dissect what was happening here. The Pac-Man eating a fruit represents the derivative function. This means the derivative of aba* with respect to a is ba*. The derivative function takes a regular expression and a character and returns the regular expression that is left to match. The regular expression that is left to match after eating the apple is ba*. This also means that the derivative of ba* with respect to b is a*.

There is one more function at work here. The function checks whether the ghost is chasing the Pac-Man or whether the Pac-Man has already matched the regex and is chasing the ghost. This function is called the nullable function; it checks whether the regex that is left to match, matches the empty string. It can do that because if the regex that is left to match, matches the empty string, the fruit it has eaten must have already been enough to satisfy the regex.

nullable: matches the empty string
not nullable: does not match the empty string

That means we only need two functions to write the derivative matching algorithm:

  1. the derivative function
  2. the nullable function

When we have these, we can loop over the input string, eat all the fruit and check that the resulting regex matches the empty string. Here we have two implementations.

One in Golang for the imperative programmers:

The derivative matching algorithm in Golang

and another in Haskell for functional programmers:

The derivative matching algorithm in Haskell

These two functions are equivalent and just written in different styles of programming. In the Haskell code, the foldl also called fold left or reduce in other languages, does the work of the for loop for you. Also, in Haskell, we don’t need commas to pass parameters to functions; since function application is the most common operation in a functional programming language, we use space to delimit parameters.

Now, let’s delve deeper into how to implement the nullable and derivative functions.

But before we do that, I don’t know if you have ever wondered about the Pac-Man origin story. I contend there was no nuclear waste barrel that Pac-Man fell into, resulting in Pac-Man gaining the power to eat ghosts. The logic is much simpler.

Pac-Man is a fruit! When Pac-Man eats other fruit, Pac-Man is being a cannibal. So if you ever get chased by a ghost, you have to eat some human flesh, and the ghost should, at least temporarily, start running away from you. Now, I haven’t tried this myself, but the logic seems sound.

This explains why zombies are always chasing humans. As David Attenborough once said:

“The zombies that are chasing us, are themselves being chased by ghosts that we can’t see. After the zombies eat some of our human flesh, you will see them exhibiting the strange behaviour of chomping at the air, this is the zombie eating the ghost that was chasing it before.”

The implementation of the nullable and derivative functions requires us first to define the basic operators available in our regular expressions.

We can think of a regular expression as describing a set of strings.

  • This means that the empty set represents an operator that matches no strings.
  • The empty string represents a singleton set of a single string that only matches the empty string.
  • The character also represents a singleton set that only matches the single character a.
  • We can then combine these base regular expressions using operators such as: or, concatenation and Kleene star, where rand s represents the two regular expression expressions we are combining.

Now that we have defined our regular expression data type, we can look at how to implement our two functions that use this data type as input.

We can start with the nullable function. Let’s look at some examples and figure out which of these regular expressions matches the empty string to get insight into how this function is implemented.

  • a* matches the empty string since zero or more includes zero.
  • (a*|b) matches the empty string since the left side of the or matches the empty string.
  • ab does not match the empty string, since it only matches the string ab
  • ab* also does not match the empty string, since ab* requires a string that starts with an a
  • (a|b) does not match the empty string, since neither the left nor the right side of the or matches the empty string.
nullable examples

I hope these examples gave us insights into how the nullable function will be defined.

Here is the implementation of the nullable function. The left side represents values that are passed into the function, and the right side represents the implementation of the function in that case. The red ghosts represent false, and the blue ghosts represent true:

implementation of the nullable function
  • The empty set doesn’t match the empty string since it doesn’t match any string.
  • The empty string matches the empty string because it only matches the empty string.
  • The character a does not match the empty string because it only matches the character a.
  • If we have a logical or, we have to check both sides. If either matches the empty string, then the logical or matches the empty string.
  • For the concatenation of two regular expressions to match the empty string, they both have to match the empty string.
  • And finally, if we have zero or more of something, then that includes zero, which means it always matches the empty string.

Let’s run an example through our function. Given the regular expression (b | a*).

  1. Our top operator is the or this means we have to check the nullability of the left and right sides: b and a*.
  2. We check and see that the character b on the left side is not nullable: false.
  3. Then we check and see that the a* on the right side is nullable: true.
  4. Now we got back false and true and we can or them to get true.

So the expression (b|a*) is nullable. This makes sense since it will match the empty string. Before we move on to the derivative function, let’s do some more exercises:

Nullable exercises

Try to walk through the implementation and check if the following regular expressions are nullable. You can click on them to check your answer:

  1. a
  2. a*(b*|∅)
  3. εa
  4. ∅*
  5. (∅|b)*(abc|ε)

Before we look at the implementation of the function, let’s look at examples of a derivative. Here we are going to take the derivative of a few regular expressions, all with respect to the character a:

  • The regular expression that is left to match after a* has eaten an apple is still a*.
  • The derivative of ab* with respect to a is b*, since we have already matched the prefix a.
  • The derivative of (a|b)b with respect to a is b.
  • The derivative of b|(a*b) with respect to a is a*b. The left b didn’t match, so we could throw it away and the a was consumed by the zero or more a’s on the right.
  • Next, we have ab*, this one is slightly tricky. After it eats the apple, the regular expression that is left to match is b(ab)*. Since we have only matched the a, we are expecting to see at least one more b.

I hope these examples gave some intuition for how the derivative function will be defined. We need to define the derivative function that takes a regex, matches a single character, and returns a new regex. Here we define the derivative function with respect to the character a. The left side represents the input to the function, and the right side represents the function’s implementation.

  • The derivative of the empty set is always the empty set. There is no way of recovering since the empty set does not match anything.
  • The derivative of the empty string concerning any character is the empty set. It wasn’t expecting to match a character. It will only match the empty string.
  • The derivative of a single character to a similar character (in this case, the apple) is the empty string since after it has matched itself, all that is left to match is the empty string.
  • The derivative of a character with respect to a different character that is not equal, in this case, the banana, is the empty set since we have not matched the specific character.
  • The derivative of an or expression is the or of the derivatives. It simply pushes the problem down to its children.
  • The derivative of the concat expression has to consider whether it can skip over the first expression. It can only skip over the first expression if the first expression matches the empty string and is nullable. So the first thing we do is check this. Let’s think about the case where it cannot skip over the first expression when the expression r is not nullable. Then the derivative is the derivative of the first expression concatenated with the second expression s. If we can skip over the first regex, we have to consider an alternative that is only the derivative of the second expression. We can then or the two alternatives of skipping over r and not skipping over r and return that as a result.
  • Finally, we have the star operator. It matches an expression zero or more times. Since we are being passed a character, this is not the zero case. So we have to consider the one or more case. This means we have to take the derivative of the expression inside the star and concatenate it again with the zero or more expression.

As shown earlier, we can combine the derivative function with the nullable function to create a regular expression matching function. The concatenation and zero or more rules can take a while to grasp fully. So, let’s do some examples.

Derivative example 1

Let’s take the derivative of (ab)* with respect to a.

(ab)* is a zero or more expression, so we look to the zero or more rule. We see this requires taking the derivative of the expression inside the star.

This is the concatenation of a and b. So we check whether the left side is nullable, and the character a is not nullable. This means we can’t skip over it. We have to take the derivative of a with respect to a. But that is the empty string, so if we concatenate the empty string with the right side, which was b, we get b.

Now, we recurse back to the zero or more, remember we took the derivative of ab with respect to a and got back a b. Now we can concatenate that with (ab)* again and we get b(ab)*.

Derivative example 2

Let’s take the derivative of (a*ba) with respect to b.

  • a*is concatenated with ba so we take a look at the concatenation rule.
  • We check whether the left side a* is nullable, which it is. This means we can skip over it, which also means we have to create an or of two derivatives.
  • The left side ends up not matching, since a* does not match b.
  • Luckily we have an alternative the ba. The derivative of ba with respect to b is and a.

So, we end up just with an a.

I have skipped over some detail here. Consider it an exercise to check my work by walking through the function yourself.

Derivative exercises

Try to walk through the implementation and check what the derivatives of the following regular expressions with respect to b. You can click on them to check your answer:

  1. εb
  2. b*(b|c)
  3. a*(b|c)
  4. bεb
  5. ∅*b

I hope you now understand why eating red cherries gives you the ability to eat blue ghosts and how to implement a regular expression matcher using the derivative algorithm.

We have covered the basic working algorithm here, but there are many ways to make this algorithm even better with very small tweaks. We cheated and glossed over simplification rules in this post, using them without explaining them, which will become especially obvious if you walk through the exercises. We also haven’t discussed how you can use memoization to build an efficient automaton lazily.

We can also easily extend the algorithm to include new operators like, not, interleave and even support Context-Free Grammars. I will discuss some of these topics in the next article.

In the meantime, I would love to see your implementation of this algorithm in a programming language you are most comfortable with. Please send me a link in the comments.

Leave a Reply