SQL - Is Your Sort Order Deterministic? A classic example of S.E.P. (Somebody Else’s Problem)
A Forgettable Rule
One of the first things you forget about SQL databases is that an un-ordered query can return results in any order, i.e. a non-deterministic sort order. Moreover, the order is not guaranteed each time you execute the query.
Here is an example of a query with a non-deterministic sort order:
SELECT * FROM bookshelf
If you run this query multiple times, you could theoretically get results with a different order each time. In reality, you will most likely get something that looks like it is implicitly ordered, and you will get it consistently every time you run your query:
+------------------------------------------------+------------+
| id | title | author |
+------------------------------------------------+------------+
| 1 | The Hitchhiker's Guide to the Galaxy | Adams |
| 2 | The Restaurant at the End of the Universe | Adams |
| 3 | Life, the Universe and Everything | Adams |
| 4 | So Long, and Thanks for All the Fish | Adams |
| 5 | Mostly Harmless | Adams |
| 6 | Dirk Gently's Holistic Detective Agency | Adams |
+------------------------------------------------+------------+
The only reason you usually get a consistent sort order is that the database usually executes the query in the same way. It's so consistent, in fact, that eventually most developers forget that you can't actually rely on this order. Unfortunately, when developers forget this rule, they make designs that depend on a non-deterministic sort order when they shouldn't. (This is foreshadowing, in case you didn't notice)
To be fair, most developers will use the <ORDER BY> command whenever sorting is needed. However, even when you use the <ORDER BY> command, the resulting order can be still be non-deterministic if you order by a column that contains non-distinct values.
This query has a deterministic sort order because it orders by the primary key, "id":
SELECT * FROM bookshelf ORDER BY id ASC
This query has a non-deterministic sort order because it orders by a column with many non-distinct values:
SELECT * FROM bookshelf ORDER BY author ASC
Again, these queries will usually end up with the same results in the same sort order. It is a coincidence. Don't depend on it. Please.
Maybe. Who cares?
-- Slartibartfast, "The Hitchhiker's Guide to the Galaxy"
A Flawed Design
A few years ago, my team inherited a legacy system with a large database. One of the features of this system is that it needs to present, to the user, tables of data. These tables are typically paginated in order to organize hundreds of items. As a nice usability feature, the tables allow the user to filter or sort on any column.
For performance reasons, the handling of the pagination, filtering, and sorting was ultimately happening at the database level:
- Pagination: Use the <LIMIT> command to control number of results and the index at which to start the result set
- Sorting: Use the <ORDER BY> command to select which column to sort by
In order to give the user full flexibility, these queries were generated dynamically. For example, if the user sorts by author and clicks on the second page, this dynamically creates the following query:
SELECT id, title, author FROM bookshelf
ORDER BY author ASC LIMIT 5,5 # Start at index 5, total 5 results
This approach cleverly allowed the pagination and sorting implementation to be re-used by hundreds of different tables. It was easy to churn out tables where the user was able to sort on any column of their choice. And the accompanying database queries were being crafted dynamically in the backend. But here's the question - were all these dynamically created queries having a deterministic sort order? (This is a rhetorical question)
The answer of course, is no. If any user decided to sort on a column with non-distinct values, this design would have a problem. The user would navigate from page to page, and because the query is not guaranteed to return the same sort order, the user might see items repeated from page to page, or never see some items at all.
However, the original designers never had to see these potential problems. The database just so happened to deliver consistent results every time. Why not rely on it permanently? They saw no need to rectify these queries with a non-deterministic sort order because it was clearly...
Somebody Else's Problem
An SEP is something we can't see, or don't see, or our brain doesn't let us see, because we think that it's somebody else's problem. That’s what SEP means. Somebody Else’s Problem. The brain just edits it out, it's like a blind spot.
-- Ford Prefect, "Life, the Universe and Everything"
The database was on a MySQL 5.0 server. My team was tasked with upgrading the server to MySQL 8.0. After we did the upgrade, we learned very quickly, that queries with non-deterministic sort order were a problem, and that we were the "somebody else" the problem belonged to.
Let's revisit our previous example where the user sorts by author, and there are 5 books per page.
This is what we saw on Page 1:
SELECT id, title, author FROM bookshelf
ORDER BY author ASC LIMIT 0,5 # Start at index 0, total 5 results
+------------------------------------------------+------------+
| id | title | author |
+------------------------------------------------+------------+
| 1 | The Hitchhiker's Guide to the Galaxy | Adams |
| 2 | The Restaurant at the End of the Universe | Adams |
| 3 | Life, the Universe and Everything | Adams |
| 4 | So Long, and Thanks for All the Fish | Adams |
| 5 | Mostly Harmless | Adams |
+------------------------------------------------+------------+
And this is what we saw on Page 2:
SELECT id, title, author FROM bookshelf
ORDER BY author ASC LIMIT 5,5 # Start at index 5, total 5 results
+------------------------------------------------+------------+
| id | title | author |
+------------------------------------------------+------------+
| 2 | The Restaurant at the End of the Universe | Adams |
+------------------------------------------------+------------+
Hang on, what happened to book 6, "Dirk Gently's Holistic Detective Agency"? And why are we seeing book 2 again?
It took some time for the truth to sink in: MySQL 8.0 is under no obligation to return the same result sequence as MySQL 5.0 if you are running a query with a non-deterministic sort order. As the errors started piling up, we grumbled angrily that MySQL shouldn't have changed things, and of course took another opportunity to rail at our predecessors. Luckily, cooler heads eventually prevailed. MySQL 8.0 is concerned with optimizing queries and providing accurately sorted results for queries with deterministic sort orders. Therefore the right way forward was to change our queries to use a deterministic sort order.
Our Solution
When we discovered our problem, we saw that it only happened for tables that had many, many columns. As these columns were hidden from the users, it was tempting to consider paring down the number of columns and to again rely on the good graces of the SQL database to provide consistent results for non-deterministic sort orders.
A learning experience is one of those things that says, 'You know that thing you just did? Don't do that.
― Douglas Adams, The Salmon of Doubt
We decided to do it the right way instead. We needed to continue to allow users to sort on any number of columns containing non-distinct values. So how were we going to change those queries to have deterministic sort orders?
We decided to take advantage of the fact that almost all of the tables had a primary key called "id". We then appended it to any query that was constructed with the <ORDER BY> command.
Our troubled Page 2 query now became:
SELECT id, title, author FROM bookshelf
ORDER BY author ASC, id ASC LIMIT 5,5 # Start at index 5, total 5 results
+------------------------------------------------+------------+
| id | title | author |
+------------------------------------------------+------------+
| 6 | Dirk Gently's Holistic Detective Agency | Adams |
+------------------------------------------------+------------+
And our missing book was back!
We had about a dozen tables that didn't have an "id" column, due to the fact that they were the result of complicated joins. It was a fairly straightforward exercise to update those tables to use an "id" column from one of its joined tables and this solved the remaining tables.
If you've gotten to the end of this article, I hope it may have shed light on your particular S.E.P. May all your future solutions arise with ease.
Solutions nearly always come from the direction you least expect, which means there's no point trying to look in that direction because it won't be coming from there.
― Douglas Adams, The Salmon of Doubt
If you have a problem and no one else can help, maybe you can hire the Kalvad-Team.