Solving Dogpile Effect | Better Programming

How to cache in high-volume scenarios

photo by Judy Neumayer Feather unsplash

We have discussed cache consistency before, and at that point we mentioned that it is possible to have good consistency if we implement read-side cache correctly. To further improve consistency, a more complex solution should be used, for example, writing to the cache or writing to the back of the cache.

Today, we are going to discuss another common scenario of caching, the Dogpile Effect.

The dogpile effect means that when the system is subjected to a huge amount of traffic, whenever a cache becomes invalid, either by cleanup or timeout, it will have a huge impact.

For example, if the cache entry is accessed by 100 requests at the same time, after the entry expires, 100 requests will directly hit the backend system, which is a serious challenge for the backend system.

So, let’s take a look at what methods are available to deal with the dogpile effect. Thus there are three general approaches.

  1. warm up cache
  2. increase cache time
  3. special lock

Each of these three options has its advantages and disadvantages, and in fact, they are already widely used.

Question, do you know which one race_condition_ttl in cash store Ruby on Rails,

Before we talk about this approach, let’s review the read path of the read-side cache.

  1. First, all read requests are read from the cache.
  2. If the cache fails to read.
  3. Then read from the database, and write back to the cache.

The problem occurs when the cash runs out.

When there are too many read requests under a high volume system, it will result in a lot of database access. If we can keep the cache active anyway, can’t we solve the dogpile effect?

Thus, the approach is to replace the cache TTL with a background thread that periodically updates all caches, for example, if the cache TTL is 5 minutes, then update all caches every 5 minutes, so that cache invalidation Do not face

But such an approach has its drawbacks.

  1. This is very space inefficient if applied to all cache entries.
  2. In some corner cases, the caching is still invalid, such as the removal triggered by the cache.

Although this is a simple and brutal method, I have to say that it works great in some situations. If we have a critical cache that has to deal with a lot of traffic and the cost of updating is high, then keeping that cache fresh is the most efficient way.

As long as we avoid updating all cache entries, it won’t take up a lot of space and is less likely to trigger corner cases.

The warm-up cache may work well on a specific critical cache, but if no critical cache is defined at all, the benefits of the warm-up cannot be effectively implemented.

Therefore, the second approach is suitable for general purposes.

While reading the cache, if a cache timeout is detected, slightly increase the cache time and start updating the cache. If there is a concurrent read request, subsequent read requests will read the cache over an extended period of time to avoid simultaneous access to the backend database.

Assuming that the cache TTL is 5 minutes, we set each cache to have an extension time of one minute, as shown in the Gantt chart below.

In the time interval 0-5, the original value will be found when the cache is read. If a read occurs during time interval 5-6, however the cache is exhausted, the cache will be expanded, so the original value will still be available. But at the same time, the first one to read in interval 5-6 should be responsible for updating the cache, so after time point 6, the cache will be updated to the new value.

Let’s use a sequential diagram to represent two concurrent request scenarios.

Suppose the cache is already exhausted. When A receives, it finds that the cache has timed out, so it first writes the original value back to the cache and performs a regular read operation to retrieve the value from the database, and finally writes the new value back to the cache. Is.

Although, B Finds that the cache has not expired on the read, because the cache is extended, so it can get the original value.

With this approach, only one of these N Concurrent requests need to reach the backend, and the rest N - 1 Requests can still get values ​​from the cache.

really, Ruby on Rails‘s race_condition_ttl Is this implementation. Lines 855 and 856 in the link above are operations that increase cache time.

This approach looks like an efficient way to handle high volume traffic, with only one request for access to the backend, doesn’t it?

The answer is, no, not really.

This is clearly useless when facing a high concurrency scenario. Let’s continue to describe the problem in a sequential diagram.

The same A And B as before, but this time A And B Very close to each other. As you can see from the sequential diagram, B already happened when A tries to write the original value back to the cache, so B It also feels like he is the first.

When N Requests all come in at the same time, so such an implementation still cannot solve the dogpile effect.

Increasing the cache time seems to have solved most of the problems, but it is still not enough in high-concurrency systems. Therefore, we need a way to serialize high-concurrency scenarios. Earlier, we introduced the concept of exclusive lock.

In this approach, we are trying to avoid multiple concurrent requests to go through the cache by exclusive lock.

First, an exclusive lock must be acquired before the cache can be updated. Only those who can acquire the lock are eligible to update the cache, i.e. access the database, while the rest who do not acquire the lock must wait for one of the following two conditions.

  1. is the cache updated
  2. can the lock be acquired

The wait-and-acquire lock must also verify whether the cache has been updated so as to avoid further duplicate database access.

Of course, this approach has an obvious drawback. Serializing a concurrent process will reduce concurrency significantly. Furthermore, waiting is an additional overhead that not only consumes resources but also affects performance.

It is worth mentioning that we often say that redis is not reliable, so in this scenario, is it necessary to use redlock to further improve the reliability?

depends on.

In general, no, even if Redis is not reliable enough to be used once, the rare occurrence of leakage in this scenario is not a problem. This is not a scenario that requires strong consistency, but rather some access to the database.

To address a common problem with caching, the dogpile effect, we’ve reviewed three common approaches.

The scenario in which the warm up cache applies is the “critical cache”. Once we can narrow down the scope of the cache, keeping it fresh is the most intuitive and easiest way for us.

Extending cache time is a general-purpose approach that can effectively deal with a variety of high-volume scenarios. By providing a time buffer, the cache is allowed to serve for a long time until the cache is updated by “someone”.

Exclusive lock is a highly concurrent specialization approach. In order to avoid concurrent requests hitting the backend system, concurrent requests are converted into sequential requests through a serialization mechanism.

Each of these three approaches has its own advantages and disadvantages, but in fact, it is possible to extend the cache time and add exclusive locks to the total solution. I will explain its implementation details in the code in the next article.

let’s call it a day.

Leave a Reply