DynamoDB is not like other databases. It trades convenience and accessibility for speed and reliability at any scale. Where a classic database might offer JOIN, UNION, IN, and GROUP BY interfaces, DynamoDB has none of these. This allows it to scale infinitely - a table with a billion items will be just as fast as one with a dozen items - as long as you design your data structures and access patterns well.
UPDATE 2021-04-29 This article got a lot of valid criticism on social media. The main critique was that the issues I’m “solving” in this article are caused by bad partition key design. To address these issues I’ve rewritten the post to make a clearer point. You can find it here: Simplify data updates in DynamoDB with DynamoDB Streams, and retain your sanity. The original post will remain online for posterity, but to learn about data duplication in DynamoDB I’d suggest reading the other post instead.
There are two main concepts in data design for DynamoDB. The first is the partition and sort key - which we will not cover in this article - and the other is data duplication. In this post we will cover what data duplication is, how it affects your table design, and the trade-offs it has for your application.
Let’s start with the basics: duplication of data means storing data multiple times so it can be quickly retrieved through various access patterns. This is the exact opposite of the data deduplication which we’re used to in relational databases. Let me clarify the difference with an example.
Let’s say we’re modeling a database for a blog. In relational databases we would have two tables: Author and BlogPost. The Author table looks like this:
ID | Name | Role | Bio |
---|---|---|---|
1 | Luc | Managing Editor | Luc is the managing editor of the blog |
2 | Jane | Science Editor | Jane is the science editor of the blog |
The BlogPost table looks like this:
ID | Title | AuthorID | Excerpt | Body |
---|---|---|---|---|
1 | My first article | 1 | A short text | A very long, descriptive text. |
2 | A science article | 2 | A short science text | A very long, interesting science text. |
The BlogPost table has a reference to the Author by ID. We could query the BlogPosts and their authors in a single query with a JOIN:
SELECT
BP.ID,
BP.Title,
BP.Excerpt,
A.Name,
A.Role
FROM BlogPost BP
LEFT JOIN Author A ON B.AuthorID = A.ID
This would yield the following result:
ID | Title | Excerpt | Name | Role |
---|---|---|---|---|
1 | My first article | A short text | Luc | Managing Editor |
2 | A science article | A short science text | Jane | Science Editor |
The core principle in this relational design is deduplication. Apart from the identifiers, all data is only stored once. If we would like to update an author’s role, we only have to update it in the Author table. Since our query fetches the data directly from the Author table, the updated role will immediately be available.
In DynamoDB the data structure would be very different. Contrary to relational databases, the leading design principle is performance, even if that means storing data multiple times. To determine the best design for performance, we first need to define our access patterns. Let’s take a look at the wireframes for our blog.
Looking at our wireframes, the access patterns are:
- Fetch all authors
- Fetch all blog posts and their authors
- Fetch all blog posts by a specific author
- Fetch a single blog with its author
In DynamoDB single table design (let’s not abbreviate that), the blog database would look like this:
PartitionKey | SortKey | Name | Role | Bio | Title | Excerpt | Body | Author |
---|---|---|---|---|---|---|---|---|
Author | Luc | Luc | Managing Editor | Luc is the managing editor of the blog | ||||
Author | Jane | Jane | Science Editor | Jane is the science editor of the blog | ||||
BlogPost | Luc#<Date> #My first article |
My first article | A short text | A very long, descriptive text. | {‘Name’: ‘Luc’, ‘Role’:‘Managing Editor’} | |||
BlogPost | Jane#<Date> #A science article |
A science article | A short science text | A very long, interesting science text. | {‘Name’: ‘Jane’, ‘Role’:‘Science Editor’} |
Note the data duplication in bold. A subset of the author information is stored with the blog posts so we only need to perform a single query to fetch blog posts and their authors.
In code, the access patterns we defined above look like this:
Fetch all authors:
BlogTable.query(Key('PartitionKey').eq('Author'))
Fetch all blog posts and their authors:
BlogTable.query(Key('PartitionKey').eq('BlogPost'))
Fetch all blog posts by a specific author:
BlogTable.query(Key('PartitionKey').eq('BlogPost') & Key('SortKey').begins_with('Luc#'))
Fetch a single blog with its author:
BlogTable.query(Key('PartitionKey').eq('BlogPost') & Key('SortKey').eq('Luc#<Date>#My first article'))
In this article we will only look at attribute-level duplication, where another item (or part of an item) is stored as the attribute of a related item - like the author stored with the blog post. This works well when the amount of duplicated data is limited: authors of a post, related blog posts, or items in a shopping cart. If the amount of duplicated data can grow beyond a limited set, for example likes of a post, the subscribers of a newsletter or blog posts in a specific category, attribute-level duplication will hit the 400KB-per-item limit. In these cases you will need to look at item-level duplication: storing an item multiple times under different sort keys to allow for fast queries through different access patterns. We will not cover item-level duplication in this post. The principles discussed in this article apply to both attribute-level duplication and item-level duplication.
Updating data (an easy introduction)
The example above shows that designing data for DynamoDB requires a completely different mindset than for relational databases. And it’s not necessarily a human-friendly or easy mindset. As Forrest Brazeal wrote:
In fact, a well-optimized single-table DynamoDB layout looks more like machine code than a simple spreadsheet — despite all the bespoke, human finagling it took to create it.
But data design might actually only be the first, and not the worst, of your issues. That distinction goes to the application code that needs to update the data in your table.
Because data is duplicated to multiple places, your application will need to keep track of these places and update all of them. Let’s say Jane has received a promotion and is now the Principal Science Editor. The pseudo code for the first update would look like this:
BlogTable.update(
Key=Key('PartitionKey').eq('Author') & Key('SortKey').eq('Jane')
UpdateAttributes=(Attribute('Role') = 'Principal Science Editor')
)
However, if a user requests the blog posts and their authors, they would still see Jane’s old role. Therefore the application also needs to update the authors stored with the blog posts:
BlogTable.update(
Key=Key('PartitionKey').eq('BlogPost') & Key('SortKey').begins_with('Jane#')
UpdateAttributes=Attribute('Author.Role') = 'Principal Science Editor'
)
This example of data duplication might seem of limited benefit. After all, retrieving the author of a blog post would have led to only one additional query. Yet two queries take twice as much time and compute capacity as one, and in the more advanced examples below we’ll see how not duplicating data can lead to an exponential increase in queries.
Updating data (an intermediate example with many-to-many relationships)
In the introduction above we’ve seen that the responsibility for data structures and data management has shifted from the database to the application. We can no longer rely on the database to manage it for us like we can in relational systems, so we have to add more code to update authors. But the previous example was a simple one. Let’s make it a bit more complex by adding an additional access pattern. Our website has been updated, and now displays ‘related posts’ next to an article.
The new access pattern is “Fetch a single blog with its author and its related posts”. We still want our website to be as fast as possible, so we store a trimmed-down version of the related blog posts with the main blog. In our single table design this is stored as follows (we’re leaving the authors out for now, duplicate data is shown in bold again):
PartitionKey | SortKey | Title | Excerpt | Body | RelatedPosts |
---|---|---|---|---|---|
BlogPost | Luc#<Date> #My first article |
My first article | A short text | A very long, descriptive text. | |
BlogPost | Luc#<Date> #My second article |
My second article | Another short text | A very long, descriptive text about a second topic. | {‘Luc#<Date> #My first article’: {‘Title’: ‘My first article’,‘Excerpt’: ‘Another short text’}} |
BlogPost | Jane#<Date> #A science article |
A science article | A short science text | A very long, interesting science text. |
This example provides a clear argument for data duplication. Theoretically we could have only stored the keys for the related posts, and then query the title and excerpt for every post separately. But if there are 50 related posts that translates to 51 DynamoDB queries, while with data duplication we’ve limited the amount of requests to only the single query.
With this data structure we can query the second blog post and immediately have the data we need to render the related posts. But there is a problem. If we choose to update the first post, we have no way to determine which other posts display it as a related post. In other words, we don’t know on which other posts we need to perform an update. We could do a scan, but you really want to avoid that in large databases. So instead, we store reverse references with every post. That looks like this:
PartitionKey | SortKey | Title | Excerpt | Body | RelatedPosts | ReferencedBy |
---|---|---|---|---|---|---|
BlogPost | Luc#<Date> #My first article |
My first article | A short text | A very long, descriptive text. | {‘Luc#<Date> #My second article’: true} |
|
BlogPost | Luc#<Date> #My second article |
My second article | Another short text | A very long, descriptive text about a second topic. | {‘Luc#<Date> #My first article’: {‘Title’: ‘My first article’,‘Excerpt’: ‘A short text’}} |
|
BlogPost | Jane#<Date> #A science article |
A science article | A short science text | A very long, interesting science text. |
You might wonder why the reference is stored as a dictionary ({'Luc#<Date>#My second article': true}
) and not a list (['Luc#<Date>#My second article']
). The reason is simple: in DynamoDB it’s way easier to remove an key from a dictionary than to remove a value from a list.
With the references stored in the first item, updating the title or excerpt requires first updating the main article (Luc#<Date>#My first article
), and then looping over every item in its ReferencedBy dictionary to update them too:
BlogTable.update(
Key=Key('PartitionKey').eq('BlogPost') & Key('SortKey').eq('Luc#<Date>#My second article')
UpdateAttributes=Attribute('RelatedPosts.Luc#<Date>#My first article') = {'Title': 'An updated title','Excerpt': 'An updated excerpt'}
)
Of course, adding new related posts also requires updates on both the post related from and the post related to. And deleting posts requires removing the post from all referencing posts first, and only deleting the main post after.
Updating data (advanced patterns with many-to-many relationships and nested data)
We’ve now seen that many-to-many relationships significantly complicate your DynamoDB table designs and application logic. But still, it can get worse. Let’s update our previous wireframe, and add the author name to the related posts too:
To allow this data to be fetched in a single query, the data in the table needs to be updated to this:
PartitionKey | SortKey | Name | Role | Bio | Title | Excerpt | Body | Author | RelatedPosts | ReferencedBy |
---|---|---|---|---|---|---|---|---|---|---|
Author | Luc | Luc | Managing Editor | Luc is the managing editor of the blog | ||||||
Author | Jane | Jane | Science Editor | Jane is the science editor of the blog | ||||||
BlogPost | Luc#<Date> #My first article |
My first article | A short text | A very long, descriptive text. | {‘Name’: ‘Luc’, ‘Role’:‘Managing Editor’} | {‘Luc#<Date> #My second article’: true} |
||||
BlogPost | Luc#<Date> #My second article |
My second article | Another short text | A very long, descriptive text about a second topic. | {‘Name’: ‘Luc’, ‘Role’:‘Managing Editor’} | {‘Luc#<Date> #My first article’: {‘Title’: ‘My first article’,‘Excerpt’: ‘A short text’, ‘Author’: {‘Name’: ‘Luc’, ‘Role’: ‘Managing Editor’}}} |
||||
BlogPost | Jane#<Date> #A science article |
A science article | A short science text | A very long, interesting science text. | {‘Name’: ‘Jane’, ‘Role’:‘Science Editor’} |
You can see where this is going: when we update the role of an author, we first need to update the author itself, then update all the blog posts written by that author, then loop over all the blog posts that reference these blog posts, and update the author there as well. This last update looks like this in pseudo code:
BlogTable.update(
Key=Key('PartitionKey').eq('BlogPost') & Key('SortKey').eq('Luc#<Date>#My second article')
UpdateAttributes=Attribute('RelatedPosts.Luc#<Date>#My first article'.Author) = {'Name': 'Luc', 'Role': 'My updated role'}
)
Two approaches to updating related data
There are two ways to approach the problem of data management in a complex DynamoDB table. The first is to put all the responsibility in the business logic. Again, in pseudo code:
*Update Role for Author 'Luc'*
update role where PK = Author and SK = Luc
update author.role where PK = BlogPost and SK begins_with Luc#
select referenced_posts from BlogPosts where SK begins_with Luc#
for each referenced_post in referenced_posts
update relatedPost.mainPostId.author.role
Technically this is definitely possible. But it will be hard to read, write and maintain, especially when new fields or models are added. Mutating data by hand, directly in the database, is also very hard. You would have to manually change every occurrence of the value you want to change. And that brings us to the second approach: event driven updates with DynamoDB Streams.
This solution allows you to keep your Update Author function simple. It just updates the author and doesn’t worry about propagating that change. The DynamoDB table is configured with DynamoDB Streams, which forward every single change to another Lambda function. This function detects a change in an author and loops over every blog post that needs to be updated. The same function, or set of functions, can also be responsible for propagating deletions. This simulates some of the concepts of a relational database (a single source of truth for data), while still delivering the performance benefits of data duplication. It also reenables manual mutations: you could change the author, and the updater function would be called the same way it would with an API-based change. However, the asynchronous design introduces its own complexities in testing, observability and bug tracing.
Conclusion
DynamoDB can scale to workloads of any size. This is achieved through its partition design, which allows DynamoDB to scale horizontally, to ‘know’ exactly where data is stored, and to retrieve data from many partitions in parallel. Application and database designers need to structure their data to make optimal use of this architecture. This is primarily achieved through the partition key and sort key, and secondarily through duplication of data.
A well-thought-out duplication scheme can significantly reduce the amount of queries, which in turn reduces cost and increases performance. But it introduces other costs: mental overhead, development time and hard-to-pin-down bugs. These are the trade-offs you must be willing to make to make optimal use of DynamoDB’s extreme performance at scale. To design an infinitely scaling data structure you need to let go of easy-to-read relational schemas and start thinking like a machine.
I share posts like these and smaller news articles on Twitter, follow me there for regular updates! If you have questions or remarks, or would just like to get in touch, you can also find me on LinkedIn.