High Performance MySQL [Ch.7: Advanced MySQL Features]
So in the previous chapter we discussed query performance optimization techniques such as, improving data access types or restructuring queries.
In this chapter will go over some of the advanced features that MySQL has to offer and see how we can leverage such features to improve performance even more. Some of the features we will discuss throughout this article include: partitioned tables, views, triggers, events, etc.
This will be the last chapter we will discuss in this book (I will discuss why at the end of this article).
A partitioned table is a single logical table that’s composed of multiple physical subtables. Each table in MySQL has a Handler that communicates with the storage engine for that table, so the partitioning code is just a wrapper around a set of Handler objects each representing one of the underlying partitions.
The partitioning code is responsible to route queries (
INSERT, ...) to the correct partitions and their handlers are responsible for forwarding requests to the storage engine layer effectively making partitioning a black box when seen from the SQL layer (but can be easily noticed if you look at the file system).
MySQL decides which partition holds each row of data based on the
clause that defined for the table. The query optimizer can prune partitions when
queries are executed, so the queries don’t examine all partitions (just the ones that
hold the data you are looking for).
Partitioning can be used to for any of the following cases:
- When the table is too big to fit in memory but the table has hot rows (usually at the end of the table) that are accessed frequently. This usually works for tables with a lot of historical data.
- Partitioned tables are much easier to maintain, for example, deleting a range of data can easily be modeled as dropping a complete partition that is defined by that range. It's also a lot easier to re-populate, fix or restore a specific partition than a complete table.
- Partitioning a table also allows for physical distribution of data which enables the storage engine to utilize multiple hard drives to store a partitioned table.
- Partitioning a table can also be used to get around some of the bottlenecks such as per-index mutexes in innoDB or per-inode locking in the file system.
How Partitioning Works?
We mentioned before, partitioned tables have multiple underlying tables, which are represented by Handler objects. You can’t access the partitions directly. Each partition is managed by the storage engine in the normal fashion (all partitions must use the same storage engine), and any indexes defined over the table are actually implemented as identical indexes over each underlying partition. So from the storage engine’s point of view, the partitions are just tables and the storage engine doesn’t really know whether a specific table it’s managing is a standalone table or just part of a bigger partitioned table.
Now let's take a look at how the most common operations are implemented for a partitioned table:
SELECT: When a
SELECTstatement is ran against a partitioned table, the optimizer determines if any of those partitions can be ignored based on the query and the partitioning key, then the statement is forwarded to the target partitions via their Handlers.
INSERT: When an
INSERTstatement is ran against a partitioned table, the partitioning layer determines which partition should receive the row and forwards the row to that partition only.
DELETE: When an
DELETEstatement is ran against a partitioned table, the partitioning layer determines which partition contains that row and forwards the deletion request to it.
UPDATE: When an
DELETEstatement is ran against a partitioned table, the partitioning layer determines which partition contains that row and forwards the modification to it, then it determines if that modification will change where the row should be stored and forwards that row to the new partition and deletes it from the old partition.
Note that during all the previous operations, all underlying partitions will be opened and locked until the correct partition(s) was/were chosen. Also note that locking doesn't mean that the entire partition will be locked, it means that the same logic that applies for normal tables will be applied to partitions (so for example, in innoDB, it will be row locks and not table locks, etc.).
Types of Partitioning
The way you partition data is heavily dependent on the nature of the problem you are trying to solve. There are many partitioning policies we can use for many different reasons, so let's take a look at a few of those policies:
- Partitioning by range: This is the probably the most obvious one so far. This means that we will partition a table based on a range of some of its columns. For example, consider the following partitioned table:
This table is partitioned by ranges on a column called
CREATE TABLE sales ( order_date DATETIME NOT NULL, -- Other columns omitted ) ENGINE=InnoDB PARTITION BY RANGE(YEAR(order_date)) ( PARTITION p_2010 VALUES LESS THAN (2010) PARTITION p_2011 VALUES LESS THAN (2011), PARTITION p_2012 VALUES LESS THAN (2012), PARTITION p_catchall VALUES LESS THAN MAXVALUE ); )
order_date, but it only considers the
YEARpart of that stored date into 4 partitions (2010, 2011, 2012 and everything higher than 2012).
- Partitioning by hash: Usually used to randomize the insertion of data in order to reduce contention on innoDB mutexes.
- Partitioning by modulo: Used to create a round-robin effect during insertion which is usually used to maintain that a certain partition will only contain a certain amount of data. For example partition date-based data by day modulo 7, or simply by day of week, if you want to retain only the most recent days of data.
What Can Go Wrong?
There are a bunch of mistakes that could degrade the performance of a partitioned table and defeat the purpose of partitioning as a whole. Let's take a look at some of those mistakes:
- NULLs can defeat pruning: If the partitioning expression yields
NULL, MySQL will insert the row into the lowest partition during insertion, but during retrieval, it will examine 2 partitions for that row: the partition defined by the conditions in the
WHEREclause and the lowest partition which can degrade performance. It's also advisable (not sure if it's enforced) to partition a table by its primary key (not nullable) or at least make sure that the partition expression can never yield a
PARTITION BYand index: If we define an index that doesn't match the partitioning clause, queries might not be prunable. For example, if we define an index on column
aand we partition the table by column
b, as we know, each partition will end up having its own index and any index look up on the index
awill check that index in every partition which could be costly. To avoid this problem, you should try to avoid indexing on non-partitioned columns unless your queries will also include an expression that can help prune out partitions. A somewhat obscure example for this is, suppose a partitioned table ends up being the second table in a join, and the index that’s used for the join isn’t part of the partition clause. Each row in the join will access and search every partition in the second table.
- Selecting partitions can be costly: As we mentioned a above, tables can be partitioned in so many different ways so the performance of different partitioning strategies is not expected to be uniform. For example, range partitioning will suffer if the number of partitions is large because each time a row is inserted, MySQL will have to scan the list of partition definitions to determine where this row should be inserted. So if you are using this type of partitioning, it's advisable to limit the number of partitions at a reasonable number (100 partitions or so) and to try to do bulk-inserts because MySQL will scan the list of partitions definition only once for the entire bulk.
How to Use Partitioning to Optimize Queries?
Partitioning introduces new ways to optimize queries as now the biggest opportunity is that the optimizer can use the partitioning function to prune partitions. As you’d expect from a coarse-grained index, pruning lets queries access much less data than they’d otherwise need to (in the best case).
Let's examine the following example, if we run the following query on a partitioned table without using the partitioning key in the query:
EXPLAIN PARTITIONS SELECT * FROM sales \G
we will get an
EXPLAIN result that looks like so:
On the other hand, if we use the partitioning key in the query like so:
EXPLAIN PARTITIONS SELECT * FROM sales_by_day WHERE day > '2011-01-01'\G
we will get an
EXPLAIN result that looks like so:
So as we can see, when we used the partitioning key, the optimizer was able to prune partitions that can't have this data, like for example
p_2010 which contains data with dates in 2010.
Partitioning is a very complex topic to cover it all in just a part of an article, so I highly recommend checking the MySQL docs for more information.
A view is a virtual table that doesn’t store any data itself. Instead, the data in the table is derived from a SQL query that MySQL runs when you access the view.
For example, if we create the following view:
CREATE VIEW Africa AS
SELECT * FROM Country WHERE Continent = 'Africa'
WITH CHECK OPTION;
Then we execute the following query:
SELECT Code, Name FROM Africa WHERE Name = 'Egypt';
The view is executed as if we wrote the full query. The view is executed in 2 different possible ways:
- The server could choose to first run the query in the view, store its result in a temporary table, convert the second query to use the temporary table. So it would do something like this:
CREATE TEMPORARY TABLE TMP_Africa_123 AS
SELECT * FROM Country WHERE Continent = 'Africa';
SELECT Code, Name FROM TMP_Africa_123 WHERE Name = 'Egypt';
This is called the
TEMP algorithm and we can see that it doesn't look very efficient for this case, but in some other cases, it will always be used, like for example if the view
UNION, aggregate functions or subqueries.
- The other way this view can be executed is by using the
MERGEalgorithm which, as you might have already guessed, tries to merge the query in the view definition and the query we are running into a single query then execute it as follows:
SELECT Code, Name FROM Country
WHERE Continent = 'Africa' AND Name = 'Egypt';
which is a lot more efficient in this case.
Views can be defined as
updatable which allows you to update the underlying tables via the view. So you can do
For for example we can do this:
UPDATE Africa SET Population = Population * 1.1 WHERE Name = 'Egypt';
Note that we added the
CHECK OPTION clause when we defined the view
Africa which means that it will always check any modifications we do against the conditions defined in the view. For example, we can't update the
continent of any country via the view because it's defined as
Africa. So running this query:
UPDATE Africa SET Continent = 'Europe';
Will yield the following error:
ERROR 1369 (HY000): CHECK OPTION failed 'world.Africa'
How to Use Views to Improve Performance?
In my opinion, I don't really think that normal MySQL views are useful for improving performance and the examples that are mentioned in the book are not very common and quite niche. One of the mentioned examples is practical however, consider that we want to create a public view of some table that excludes certain columns (those that have PII data for example), so instead of implementing complex privileges, we can just create a view that excludes those columns and grant access to that view only.
For example, let's create this view of the
CREATE VIEW public.users AS
SELECT first_name, last_name -- but not social_security_number
Then we can run:
GRANT SELECT ON public.* TO public_user;
To grant a certain database user
SELECT access on those
MySQL doesn't support materialized views which would have been really useful for improving performance because they execute complex queries and they update periodically in the background so
SELECTs on the materialized views are much faster. They can however be emulated by using summary and cache tables that we have discussed previously.
Triggers let you execute code when there’s an
DELETE statement. You can direct MySQL to activate triggers before and/or after the triggering statement executes.
Triggers can simplify application logic and improve performance, because they save round-trips between the client and the server. They can also be helpful for automatically updating denormalized and summary tables.
Triggers in innoDB are also executed in the same transaction as the original query so they are atomic and can be used to perform atomic operations.
While triggers seems useful in some cases, they introduce some problems such as:
- In MySQL 5.6, you could only have 1 trigger per event per table, so you can't have 2
AFTER INSERTtriggers on the same table for example. This limitation doesn't exist anymore in the current version (v8).
- Triggers in MySQL are always defined in a per-row semantic. So the code you define in the trigger is executed
FOR EACH ROWwhich could be very inefficient for working with large datasets.
- Triggers obscure what your server is really doing because a simple statement can make the server perform a lot of invisible work.
- Triggers can be hard to debug and it’s often difficult to analyze performance bottlenecks when triggers are involved.
- Triggers can cause non-obvious deadlocks and lock waits. If a trigger fails the original query will fail and if you’re not aware the trigger exists, it can be hard to decipher the error code.
Events allows you execute periodic statements in a very similar form to what a cronjob would do. You can create events that execute SQL code once at a specific time, or frequently at a specified interval.
For example let's say we wish to run
OPTIMIZE TABLE once a week, so we can define an event that does so as follows:
CREATE EVENT optimize_somedb ON SCHEDULE EVERY 1 WEEK
The common practice while defining events is to wrap their logic in a stored procedure and only have the event call that stored procedure, like the procedure
optimize_tables in the previous code.
Not that unlike many cronjob handlers, MySQL doesn't protect against cases where 2 events will run concurrently (for example an event took a very long time that the next run was scheduled while the previous run was still being executed) so you will have to protect against that yourself. You can do so by using locks as follows:
CREATE EVENT optimize_somedb ON SCHEDULE EVERY 1 WEEK
DECLARE CONTINUE HANLDER FOR SQLEXCEPTION
IF GET_LOCK('somedb', 0) THEN
DO CALL optimize_tables('somedb');
MySQL now offers a lot of features to do many different things, some of them are useful to use for improving performance and some other are useful to simplify application logic. In this article we went over some of those features but there are a lot more than that you can always check in the MySQL docs.
Why is this the Last Chapter in this Series?
As mentioned in the article intro, this chapter will be the last chapter I will add to this series. In my opinion, the first 7 chapters which we now have discussed, you get a very good idea on database internals, design techniques, performance optimizations and advanced features.
The remaining of the book discusses further topics that might not come in handy for anyone any time soon. It discuss topics like optimizing sever settings, operating system and hardware, backup and recovery, replication, availability, etc.
Those topics while they are useful, they are mostly out of your control as a software engineer, they are more operational and currently managed by many cloud providers that you may never have to deal with those topics at all.
Be sure to check my series on Amazon Aurora which is one of the most famous managed MySQL products.
I hope this series was useful!
Checkout previous articles in this series here:
- [Chapter 1] MySQL Architecture and History
- [Chapter 2] Benchmarking MySQL
- [Chapter 3] Profiling Server Performance
- [Chapter 4] Optimizing Schema and Data Types
- [Chapter 5] Indexing for High Performance
- [Chapter 6][Part 1] Query Performance Optimization
- [Chapter 6][Part 2] Query Performance Optimization