Optimize the Django ORM 🚀
This article includes some updates from daredevil82 in this reddit comment which I have incorporated.
Recently, I have been optimizing some functions that were slower than expected. As with most MVPs, the initial iteration was to get something working and out there. Looking at Scout APM revealed that some of the database queries were slow, including several n+1
queries. The n+1
queries happened because I was looping over a set of models, and either updated or selected the same thing for each model. My goal was to reduce any duplicate queries, and squeeze out as much performance as I could by refactoring the naive, straight-forward operations into more performant equivalents.
In all honesty, the code is slightly more complicated to read through now, but I cut the time for my use-case in half without changing anything else about the server or database.
Use the ORM, Luke
One of Django's main benefits is the built-in models and object-relational mapper (ORM). It provides a quick to use, common interface for data operations for your models and can handle most queries pretty easily. It can also do some tricky SQL once you understand the syntax.
It's easy to get building quickly. It’s also easy to end up making more (costly) SQL calls than you realize.
Hasta la vista, models
Here are some sample models that will be used to illustrate some of the concepts below.
# models.py
class Author(models.Model):
name = models.CharField(max_length=50)
class Book(models.Model):
author = models.ForeignKey(Author, related_name="books", on_delete=models.PROTECT)
title = models.CharField(max_length=255)
is_read = models.BooleanField(default=False)
Show me the sql (part 1)
Because the SQL calls are abstracted behind a simple API, it's easy to end up making more SQL calls than you realize. You can retrieve a close approximation with the query
attribute on a QuerySet, but heed the warning about it being an "opaque representation".
books = Book.objects.all()
print("books.query", books.query)
Show me the sql (part 2)
You can also add django.db.logging
to your configured loggers to see generated SQL get printed out to the console.
"loggers": {
"django.db.backends": {
"level": "DEBUG",
"handlers': ["console", ],
}
}
Show me the sql (part 3)
You can also print out the time and generated SQL that Django stores on the database connection.
from django.db import connection
books = Book.objects.all()
print("connection.queries", connection.queries)
Explain me the sql
Django also has explain() "which details how the database would execute the query, including any indexes or joins that would be used". I've only used this for PostgreSQL, but it is supported for most database backends except for Oracle.
Explain
can give you detailed insight into performance problems, but it can also be a little obtuse to read the query plan. Luckily, explain.depesz.com can make the plan more readable. For really gnarly problems, pgMustard allows 5 free EXPLAIN
analyses, and the paid plan is very reasonable for a business that prioritizes database performance, but can't afford a specialized DBA.
The one Toolbar to rule them all
If your code is called from a view, the easiest way to start deciphering what SQL is generated is installing Django Debug Toolbar. DDT provides an unbelievably helpful diagnostic tool which shows all of the SQL queries being run, how many are similar to each other and how many are duplicated. You can also look at the query plan (similar to the output from explain()
) for each SQL query and dig into why it might be slow.
Silky smooth profiling
If you're using Django as an API (e.g. using Django REST Framework
), Django Debug Toolbar
won't be as useful because it requires a template to show its debugging panels. django-silk
can be used in those instances to get some of the same generated SQL for each of your endpoints.
Select and prefetch all the relateds
One thing to realize is that Django's ORM is "lazy" by default. It will not run queries until the result has been asked for (either in code or directly in a view). It also won't join models by their ForeignKeys until needed. Those are beneficial optimizations, however they can bite you if you don't realize.
# views.py
def index(request):
books = Book.objects.all()
return render(request, { "books": books })
{% for book in books %} Book Author: {{ book.author.name }}
<br />
{% endfor %}
In the code above, each book in the for loop
in index.html
will call the database again for the author's name. So, there would be 1 database call to retrieve the set of all books, and then an additional database call for every book in the list.
The way to prevent the extra database calls is to use select_related
to force Django to join to the other model once and prevent subsequent calls if that relation is used.
Updating the view code to use a select_related
would reduce the total sql calls to only 1 for the same Django template.
# views.py
def index(request):
books = Book.objects.select_related("author").all()
return render(request, { "books": books })
In some cases select_related
won't work, but prefetch_related
will. The Django documentation has lots more details about when to use prefetch_related
.
Beware the instantiating of models
When the Django ORM creates a QuerySet
it takes the data retrieved from the database and populates the models. However, if you don't need a model, there are a few ways to skip constructing them unnecessarily.
values_list
will return a list of tuples for all of the columns specified. Particularly useful is the flat=True
keyword argument which returns a flattened list if only one field is specified.
# get a list of book ids to use later
book_ids = Book.objects.all().values_list("id", flat=True)
You can also create a dictionary with the pair of data that might be required later with values
. For example, if I was going to need blog ids and their urls:
# get a dictionary of book id->title
book_ids_to_titles = {b.get("id"): b.get("title") for b in Book.objects.all().values("id", "title")}
To get all of the book ids: book_ids_to_titles.keys()
. To get all titles: book_ids_to_titles.values()
.
Somewhat related, bidict
is fantastic for an easy way to retrieve a dictionary's key from its value and vice versa (as opposed to keeping around 2 dictionaries).
book_ids_to_titles = bidict({
"1": "The Sandman",
"2": "Good Omens",
"3": "Coraline",
})
assert book_ids_to_titles["1"] == book_ids_to_titles.inv["The Sandman"]
Filtering on ids makes the world go 'round
Using filter
translates to a WHERE
clause in SQL, and searching for an integer will almost always be faster than searching on a string in Postgres. So, Book.objects.filter(id__in=book_ids)
will be slightly more performant than Book.objects.filter(title__in=book_titles)
.
Only and defer to your heart's content
Only
and Defer
are mirror opposite methods to acheive the same goal of only retrieving particular fields for your model. Only
works by SELECTing the specified database fields, but not filling in any non-specified fields. Defer
works the opposite way, so the fields will not be included in the SELECT statement.
However, this note in the Django documentation is telling:
They provide an optimization for when you have analyzed your queries closely and understand exactly what information you need and have measured that the difference between returning the fields you need and the full set of fields for the model will be significant.
Annotate and carry on
For some code, I was getting a count for each model in a list in a loop.
for author in Author.objects.all():
book_count = author.books.count()
print(f"{book_count} books by {author.name}")
This will create one SQL SELECT
statement for every author. Instead, using an annotation
will create one SQL query.
author_counts = (
Author.objects
.annotate(book_count=Count("book__id"))
.values("author__name", "book_count")
)
for obj in author_counts:
print(f"{obj.get('book_count')} books by {obj.get('author__name')}")
Aggregation
is the simpler version of annotation
if you want calculate a value for all objects in a list (e.g. get the maximum id from a list of models). Annotation
is useful if you want to calculate values over each model in a list and get the output.
Bulk smash! Errr, create
Creating multiple objects with one query is possible with bulk_create
. There are some caveats to using it, and unfortunately you don't get a list of ids created after the insert which would be useful. But, for simple use-cases it works great.
author = Author(name="Neil Gaiman")
author.save()
Book.objects.bulk_create([
Book(title="Neverwhere", author=author),
Book(title="The Graveyard Book", author=author),
Book(title="The Ocean at the End of Lane", author=author),
])
We want to bulk you up
update
is a method on QuerySet
, so you are able to retrieve a set of objects and update a field on all of them with one SQL query. update
can only be used when the field should be updated to the same value for all models.
neil_gaiman_books = Book.objects.filter(author__name="Neil Gaiman")
neil_gaiman_books.update(is_read=True)
However, to efficiently update a queryset of models with different field values, bulk_update
can be used.
neil_gaiman_books = Book.objects.filter(author__name="Neil Gaiman")
neil_gaiman_books[0].is_read = True
neil_gaiman_books[1].is_read = False
neil_gaiman_books[2].is_read = True
neil_gaiman_books.bulk_update()
bulk_update
comes with a few caveats, namely that you cannot update the model's primary key and that the model's pre_save
and post_save
signals will not fire when it gets updated.
Gonna make you sweat (everybody Raw Sql now)
If you really can't figure out a way to get the Django ORM to generate performant SQL, raw sql
is always available, although it's not generally advised to use it unless you have to.
Automatic for the people
django-auto-prefetch will automatically prefetch the foreign keys or one-to-one models. It's a great way to create more performant queries just by inheriting from a different base Model
class. Highly recommended to save yourself from manually trying to figure out what fields are needed in select_related
or prefetch_related
method calls.
Indices make the world go round
If you are still running into slowness, you can start to investigate database indexes. Going into detail about indexes is out of scope for this article, but some clues can be deciphered by using the output from explain
. Django has a robust infrastructure for specifying indexes that ensures they are part of your database migrations. For more information, Indexing in Postgres and Postgres Indexes for Newbies are two solid articles that give an overview of indexing.
Putting on the ritz
The Django documentation is generally really helpful and will give you more in-depth details about each technique above. If you know of any other approaches to squeezing the most performance out of Django, I would love to hear about them at twitter.com/adamghill.
Related Content
Hi, I'm Adam 👋
I've been a backend programmer for ~20 years in a variety of different languages before I discovered Python 10 years ago and never looked back.
alldjango
includes all the hard-won experience I've gained over the years building production-scale Django websites.
Feel free to reach out to me on Mastodon or make a GitHub Issue with questions, comments, or bitter invectives.
All code is licensed as MIT.