How to Scan a 23 GB DynamoDB Table in One Minute | by Vlad Holubiev | Nov, 2022

a 50x speed improvement

While running DynamoDB workloads in production with over 500GB of data for a few years now, I’ve accumulated a few lessons learned that I’m sharing. I wrote about one of them earlier: Long DynamoDB queries can be reduced to 2x . how to speed up,

AWS DynamoDB is a great database. It is fast, reliable and infinitely scalable. However, the query language is quite limited. The flexibility of your queries depends heavily on the hash key design, which you can read about Here,

While you can go a long way with CRUD operations on a small subset (<1,000) of items, sometimes you find yourself in a situation where you need to query a large table by an attribute which is not part of your hash key.

Imagine a table containing user information, where the hash key is the user’s ID.

There are basically two primary recommended APIs for getting data from DynamoDB: get item And Question, With GetItem you are limited to fetching 1 single record by it’s key, and with query, you can have multiple records sharing the same key.

But what if you want to query an item based on an attribute that is not in the key? For example, by senderId As in our example?

AWS DynamoDB provides an API for that: scan, However, it comes with a number of caveats:

  • Scanning is slow. Very slow It’s basically a brute-force mechanism to scan your entire table, each record one by one
  • Scanning is expensive. With DynamoDB, you pay for the read request units you consume. And by scanning the entire table, you’ll pay $1.25 for a 10 GB table
  • Scanning is cumbersome. The Scan API is paged. Each scan request returns up to 1 MB of data. So to scan a 10GB table, you’ll have to paginate it sequentially, making 10,000 requests.

So if the scan API is so bad, why would you ever want to use it? Sure, you shouldn’t be using it in production code, but there are other use cases:

  • Migrating data from DynamoDB table to another table or another datastore
  • Finding a small subset of data from the entire table for debugging purposes

Actually, there are a lot of AWS services that use DynamoDB Scan API under the hood. For example S3 export from DynamoDB either AWS Athena federated query,

Let’s say you evaluated all the cons and still decided to use scan. We can’t make it cheap. but we can make it faster And Easy to use! Let’s see how

To solve the first problem of motion, we use a. You can use parallel scan, I am quoting the AWS Documentation here:

by default, Scan The operation processes the data sequentially. Amazon DynamoDB returns data to the application in 1 MB increments, and an application performs additional Scan Operations to retrieve the next 1 MB of data.

Scan The operation can logically split a table into multiple segmentsWith multiple application workers scanning the segments in parallel.

Here’s an excellent example from the AWS docs:

The idea is to issue the scan API requests in parallel, by providing Segment And TotalSegments Parameters:

  • Worker #1: Volumes: 0, Total Volumes: 10
  • Worker #2: Volumes: 1, Total Volumes: 10
  • Worker #3: Volume: 2, Total Volume: 10
  • …and so on

By doing this, your table will literally be split into multiple sections, and then you paginate through the results for each segment, just like you did with the regular scan API call.

And one of the cool things is that you can request as many segments as 1,000,000, So if your table is 10GB, and you split it into 1M chunks, each worker will only need to scan 10MB of data, which equates to only 10 requests.

Each worker can either be a dedicated thread in your process, or it can be a separate server, or even better, a lambda function.

That’s why we have found a way to overcome the problem of speed. By using parallel scan, we can make scanning very fast. Let’s now take a look at the ease of use aspect of a problem.

Keeping track of all the sections and paging through the results when you need to do something is no fun coding problem. Fortunately, I have created a Node.js library to help solve this problem.

Github repo: dynamodb-parallel-scan

This library encapsulates the logic behind several things you need to remember when implementing parallel scanning:

  • Monitoring scan segments and controlling request concurrency
  • Getting a stream of objects visible during scanning versus bringing them all together
  • Backpressure mechanism to avoid hitting the process memory limit when receiving large amounts of data. This will prevent scanning requests until you process the back of the data

Here’s a sample code of using the library to scan a table with 1,000 parallel requests:

const parallelScan = require('@shelf/dynamodb-parallel-scan');

(async () =>
const items = await parallelScan(

TableName: 'files',
FilterExpression: 'attribute_exists(#fileSize)',
ExpressionAttributeNames:
'#fileSize': 'fileSize',
,
ProjectionExpression: 'fileSize',
,
concurrency: 1000
);

console.log(items);
)();

This approach is great when you know that the data you are going to fetch will fit in your memory, and you are fine to get the data after a full table scan has finished.

Another method of fetching data in a streaming way is useful when the volume is too large to hold in memory, so you want to consume it as soon as you accumulate a portion of the scanned item.

You can control concurrency, chunk size, and high watermark, which controls the backpressure mechanism to avoid fitting too much data into memory. The library will pause scanning until you are able to iterate over the next piece of data.

const parallelScanAsStream = require('@shelf/dynamodb-parallel-scan');

(async () =>
const stream = await parallelScanAsStream(

TableName: 'files',
FilterExpression: 'attribute_exists(#fileSize)',
ExpressionAttributeNames:
'#fileSize': 'fileSize',
,
ProjectionExpression: 'fileSize',
,
concurrency: 1000, chunkSize: 10000, highWaterMark: 10000
);

for await (const items of stream)
console.log(items); // 10k items here
// do some async processing to offload data from memory
// and move on to cosuming the next chunk
// scanning will be paused for that time

)();

Well, let’s look at the benchmarks now.

here is the piece of code I used to run benchmarks. The benchmark will run in a single process on a single machine.

I am going to run the script with DEBUG=ddb-parallel-scan It does a scan as to the environment variables to see the debug output of the library.

I have a DynamoDB table which is 23 GB in size and contains 36,000,000 records. Each record is an average of 600 bytes.

For the first time, I run the script with concurrency: 1 and next time i run it concurrency: 250,

Here’s a glimpse of the debug output as the code runs:

ddb-parallel-scan (96%) [224/250] [time:196ms] [fetched:0] [total (fetched/scanned/table-size):908/34853064/36298819] +3ms
ddb-parallel-scan (96%) [145/250] [time:216ms] [fetched:0] [total (fetched/scanned/table-size):920/34854754/36298819] +8ms
ddb-parallel-scan (96%) [210/250] [time:232ms] [fetched:0] [total (fetched/scanned/table-size):920/34856457/36298819] +5ms
ddb-parallel-scan (96%) [211/250] [time:223ms] [fetched:0] [total (fetched/scanned/table-size):920/34858126/36298819] +3ms

You can see:

  • Which section is currently being endorsed (145-224)
  • How long does it take (~200ms on average)
  • How many items that met the filtering criteria were found (920 so far)
  • and progress of scanned items (34 items out of 36 million were scanned)

Let’s compare the results!

So in my example, by running the script on the t4g.small EC2 instance, I was able to scan a 23 GB table Just a minute, Whereas the sequential scan took 49 minutes. 50x faster, not bad!

The only time you won’t see an improvement in performance is when your hash key distribution is not uniform enough. If you have a table where 100% of the items have the same hash key – DynamoDB will not be able to split your table into segments. Your code will end up scanning one segment at a time.

So as a rule of thumb, try not to concurrently set more than multiple unique hash key values ​​in a table. The illustration below shows a good (left) and a bad (right) hash key design.

Uniformly unique hash keys on the left, and unbalanced hash keys on the right. Parallel scanning will work better for the table on the left, and will have no positive effect on the second table.

While running the scan, our table reached a consumption of ~43,600 read capacity units.

A single scan on such a table would cost us $2.88. here is Contact AWS Cost Calculator for saved reports.

Note that my table has on-demand capability mode capable. This means that AWS will temporarily increase the read capacity of the table for a one-time scan operation only as needed. If you’re on provisioned capacity mode, you might want to fine-tune it concurrency Parameters to avoid throttling.

The DynamoDB Scan API provides a way to issue concurrent requests to scan a table concurrently. Using 100-200 segments in a single process will give you At least 50x improvement in scanning speed.

If you want to go beyond this, make sure your machine has enough network bandwidth to handle so many concurrent requests. If not – consider splitting the scan operation across multiple machines.

For convenience, you can use dynamodb-parallel-scan Read the node.js library, or implementations for inspiration to implement this in other programming languages.

If you liked this article, consider following me to read more about the lessons learned (and horror stories) around DynamoDB, AWS and Serverless.

Leave a Reply