High Performance MySQL [Ch.6 | Part 1: Query Performance Optimization]

So the last article discussed chapter 5 of High Performance MySQL titled Indexing for High Performance. Today's article discusses chapter 6 titled Query Performance Optimization.

In last two chapters we discussed schema design optimization and indexing which are necessary for good performance but not solely enough as you still need to write optimized queries otherwise even with the best designed schema and indexes queries will not perform as well as expected.

This chapter starts by some general considerations for query performance, a discussion of slow queries and the things to consider first when a query is not performing well, it then moves to discuss some query execution basics and a discussion of MySQL query execution engine and optimizer and finally it discusses how to optimize some specific types of queries.

This chapter however is quite long and full of details, so to keep this article short, I will split it over 2 articles, so this is part 1.

Slow Queries

Queries are broken up into a series of tasks that have to be performed in order to execute a query, so it follows that if we want to optimize a query, i.e. make it take a smaller amount of time to execute (minimize its response time) we will have to understand which tasks it's performing and either eliminate some of those tasks, make them happen less often or make them execute faster.

In chapter 3, we discussed ways to profile a query and get a better idea about which tasks it's performing. You can understand what a query is spending time on by following the sequence diagram of the stages you get from profiling the query, where it's parsed, planned, executed and the result set is returned back to client. Execution is one of the most important stages a query passes through and the most time consuming because it involves a lot of calls to the storage engine to retrieve rows and it also includes any post-retrieval operations such as sorting and grouping.

While doing what's needed during the execution phase, there is also the incurred cost of network calls, CPU work, I/O accesses, waiting for locks, etc.

Slow Queries: Optimizing Data Access

One of the top reasons for degraded query performance is working with too much data because after all it translates into more calls to the storage engine to retrieve rows with all the incurred cost of those calls. A query can be accessing more data than it needs either because:

The application is fetching more data than it actually needs.
MySQL server is fetching more data than it actually needs.

So let's analyze each of those two possibilities.

Are You Asking The Database For Data You Don't Need?

This happens when the application layer is requesting a lot of data and then it throws some of it away because it doesn't actually need it which adds a lot of overhead on the MySQL server, here are some of the common mistakes:

Fetching more rows than you need: It's a common misunderstanding that MySQL provides results on demand, but in truth, MySQL would compute the whole result set and return it back to the user. So if you design queries that are SELECTs on all rows and then in the application layer you fetch the first \(N\) rows, MySQL would still fetch everything although you only need just a subset of the result set. The obvious solution here is to use LIMIT in your SELECT statements if you only need a few rows.
Fetching all columns from a multitable join: This will force MySQL to read the full length of the rows used in the join which will result into reading too much data than actually needed. So for example, instead if using SELECT * in a multitable join, replace it with SELECT col1, col2, ... to only select those columns you need.
Fetching all columns: This is similar to the previous point but even with queries that are not joins, fetching all columns result into more I/O and most importantly it defeats the purpose of having covering indexes. It's not as bad as a multitable join and might be reasonable sometimes if it would simplify development or if the application has a caching layer and the other columns are used later on.
Fetching the same data repeatedly: This is a very common mistake, let's say you are populating an email template given a user id, you might see some code that would query the users table to fetch the username and then queries it again to fetch the email and then again to fetch the profile picture URL, etc. It's always useful to fetch all what you need at once to avoid hitting the database more than you actually need to.
N+1 problem: This is actually not mentioned in the book but I have seen it a lot and it causes quite the performance degradation. In many cases in a database we would have relations between tables, let's consider a database with a users table and an emails table and a user can have many emails. In many ORMs, you can do user.emails to fetch the emails associated to that user which executes a single query for that. But let's say you want to fetch all users and for each user fetch all of their emails. Many ORMs would lazy load this query which will result into 1 query to fetch all users and that would fan out into \(N\) queries (one for each user) to fetch their emails. To understand this better, if we have 3 users with ids \([1, 2, 3]\), it will fire queries that looks like:
```
SELECT * from emails where user_id = 1;
SELECT * from emails where user_id = 2;
SELECT * from emails where user_id = 3;
```
This can be optimized by eagerly loading the association which converts the 3 queries to fetch the emails into 1 query that looks like so:
```
SELECT * from emails where user_id IN (1, 2, 3);
```
effectively reducing \(N+1\) queries into just 2 queries.

Is MySQL Examining Too Much Data?

Once we are sure that the application code is not asking for more data than it actually needs, we can start investigating if MySQL is examining too much data to generate a result set. The three main metrics used to estimate query cost are:

Response time.
Number of rows examined.
Number of rows returned.

None of these metrics is perfect for measuring query cost but they give a somewhat good idea on how much data MySQL has to access internally and gives an approximation on how fast the query runs. All three metrics are logged in the slow query log, so looking at the slow query log is one of the best ways to find queries that examine too much data. So let's look at each of those metrics:

Response Time

While response time looks like a good metric to know how fast is a query, it's not to be taken at face value because it doesn't really represent if the query is doing more work than it needs to do at first glance. The response time of a query is broken into service time which is the time it takes doing work and queue time which is the time it takes being queued waiting for resources or I/Os to finish.

A major drawback in the response time as a metric is that it's heavily affected by transient conditions such as another heavy query running concurrently or a temporarily slow network connection.

All in all, response time might give an initial idea that a query is slow but alone it can't be relied upon and have to be supplemented with other metrics to make meaningful decisions.

Rows examined and rows returned

Those two metrics show the number of rows that MySQL had to examine and the number of rows it eventually returned. Common sense says that the number of examined rows should be equal to the number of returned rows for optimal execution, but this is rarely possible in practice because in many cases MySQL will have to examine multiple rows to generate a single row in the result set, for example in joins, so it will always be a ratio that ranges from \(1:1\) to maybe \(10:1\) and it can be orders of magnitude more.

Rows examined and access types

All of the previous metrics don't seem like reasonable methods to analyze slow queries. Row access types however is a far better method for that. So what's an access type? When MySQL needs to access rows, it can often access them in different ways that range from a table scan, to an index scan, to a range scan to a unique index lookup to a constant access. Each of these is faster than the one before it, because it requires reading less data.

You can know the access type MySQL decided to use from the EXPLAIN result and you can often remedy bad access types by adding the correct indexes or re-writing your queries to make optimal use of the current available indexes. Let's take an example to clear this up, let's consider this query:

SELECT * FROM film_actors WHERE film_id = 1;

To understand what's the access type used to fetch the film_actor rows, we can do:

EXPLAIN SELECT * FROM film_actors WHERE film_id = 1;

And check the key column, let's say it shows idx_fk_film_id which means that it will use an index on film_id, let's also check the ref key and let's say it shows const which means that it will require a constant access of the idx_fk_film_id index. The access type in this case is ref. This means that the query is being executed in the most optimal access type.

Now let's drop the idx_fk_film_id index and see how it affects the query.

ALTER TABLE film_actors DROP FOREIGN KEY fk_film_actor_film;
ALTER TABLE film_actors DROP KEY idx_fk_film_id;

Now the EXPLAIN will show NULL in the key column which means that there are no indexes MySQL can use and it will show NULL in the ref column and ALL in the access type columns which means that it will do a full table scan. The EXPLAIN result also has a column called rows which shows an approximation of how many rows MySQL is planning to scan. Before dropping the index it was 10 which means MySQL is only scanning rows that match the condition but after dropping the index is shows 5073 which is the full count of the table.

The EXPLAIN result also has a column called extras which shows using where which means that MySQL server will apply a WHERE clause after fetching rows from the storage engine (remember we said before that the MySQL version in the book had a limitation that the sever couldn't push down WHERE conditions to the storage engine). MySQL can apply conditions in either one of the following ways:

Apply the conditions to the index lookup to eliminate non-matching rows which happens in the storage engine layer.
Use a covering index (shows as using index in the extras column) which means that it will use the covering index to filter rows received from the index without accessing the rows from the table storage. This happens in the server layer.
Retrieve rows from the table then filters them out (this shows as a using where in the extras). This happens at the server layer and requires reading rows from the table so it's the slowest.

So it's important to make sure your queries are being executed with an optimal access type and if not so, you can try adding correct indexes or re-writing your queries. You may also have to do changes in the schema like for example using summary tables or materialized views, etc.

Ways to Restructure Queries

As you optimize queries, your goal should be finding alternative ways to get the same results but in a more efficient manner. Sometimes, you don't really need to get the same result if getting a different result will offer a performance improvement that you can trade for inaccuracy. In this section we will discuss some of the ways to restructure queries to improve their performance.

Complex Queries Vs. Many Queries

This is a very common design decision you usually face which is deciding between having one complex query that does a multitude of tasks or break it up into \(N\) simpler queries. The usual approach here is to do more work with as few queries as possible to avoid hitting the database more than once. This approach was historically better because of the cost of network communication and the overhead of the query parsing and optimization stages.

However this is not particularly useful for MySQL as it's designed to be able to handle multiple connections and connect/disconnect very efficiently and to respond to small and simple queries very quickly. Modern networks are also significantly faster so the network cost here is not as significant as it used to be, specially for VPCs.

There is not a rule of thumb here that fits all cases, the idea behind this is to not be afraid to decompose a complex query into simpler queries and weigh the costs of doing so and it might actually end up being more efficient and easier to develop and maintain.

Chopping Up a Query

Another way to slice up a query is to keep the same but run it on smaller chunks of data that affect smaller number of rows each time. A perfect use case for this is queries that alter a lot of rows, like for example a purge job that deletes old records. A monolithic way to do this for example, run this query:

DELETE FROM messages WHERE created < DATE_SUB(NOW(),INTERVAL 3 MONTH);

Which will delete all messages that are at least 3 months old. The problem with this query is that it will run for too long which might lock rows for a long time, fill up the transaction log, slow down small queries, introduce replication lag and overload the server because it has to complete at once.

Another approach is chop this query up to be the same but operate on smaller ranges of the data, like for example:

rows_affected = 0

do {
  rows_affected = do_query(
    "DELETE FROM messages WHERE created < DATE_SUB(NOW(),INTERVAL 3 MONTH)
     LIMIT 10000"
  )
} while rows_affected > 0

Which will limit the delete the query to 10,000 rows, each of these queries will execute much faster which will avoid the problems we mentioned above. The loop will keeping firing queries until one of the queries does nothing. Another upside of this approach is that you can add a sleep in between queries to avoid overloading the server and you can even schedule them in background jobs that run over a day for example to avoid having spikes.

Join Decomposition

In many cases you might need to write a multitable join that would join a few tables together to generate a result set. Let's consider the following example Which fetches posts tagged by a tag called mysql:

SELECT * FROM tag
JOIN tag_post ON tag_post.tag_id=tag.id
JOIN post ON tag_post.post_id=post.id
WHERE tag.tag='mysql';

Join decomposition is the act of the splitting the join operation into simple queries and then performing the join in the application level. This is very similar to a hash join if you are familiar with it. We can decompose the previous query as follows:

SELECT id FROM tag WHERE tag='mysql';
SELECT post_id FROM tag_post WHERE tag_id=1234;
SELECT * FROM post WHERE post.id in (123,456,567,9098,8904);

What this does is that it fetches the the id of the tag called mysql and uses it to query the tag_post table to fetch the post ids that are tagged by the tag mysql and then it queries the posts table to fetch the posts that match the fetched ids.

This might seem wasteful at first glance because one query turned into three queries, but this not always the case as join decomposition can offer the following advantages:

It's more cache friendly as you might see that the id of the tag will probably never change, so you cache it. You can devise cache mechanisms that cache the post ids that are tagged by a certain tag (although more complex).
The three produced queries are shorter and would execute more quickly thus hogging resources less and reduces lock contention.
The queries themselves are more efficient, like for example using IN enables MySQL to sort the ids and fetch the rows in a more optimal manner while joins might need to some random I/Os.
Performing the join in the application layer fetches the data only once because joins are by design might fetch the same piece of data multiple times in order to generate a row in the result set.

Note that MySQL normally performs joins using a nested loop algorithms while what we did is effectively transforming it to a hash join.

This concludes part 1 of this article, in the next part we will take a look at query execution basics, the MySQL/client protocol, MySQL limitations and a quick guide to optimizing some special types of queries such as COUNT, JOIN, GROUP BY, sub-queries, etc.

References

High Performance MySQL: Optimization, Backups, and Replication

Checkout previous articles in this series here: