SQL Optimisation 101: How the basics can improve query speed drastically

Looking for a way to optimise SQL queries for speed? Sometimes, the answer lies in the basics, and you don't have to know complex techniques.

Dear Diary,

A recent event at work took me back to my beginner days at SQL. I was then a bumbling student in STEM who grew starry-eyed as the output of his first SELECT * query materialised. At that point, being able to fetch the details of all employees with a salary higher than ten thousand felt equivalent to having access to an arcane science of magic and wizardry. Fast forward six years, the same guy builds intricate queries enmeshing highly-normalised tables, with CTEs and subqueries, window functions that leap unbounded, making ranks both smart or dense, transposing where needed, pivoting where not, doing it all at the speed of light to solve some of the most complex business requirements. When you're given shiny new toys, the old ones are often forgotten, and in much the same way, whenever I needed to optimise queries, I turned to higher-level methods. But a lot of the times, the simplest solutions are the best way forward, as I was soon to learn.

The business problem

In one of our engagements with a major client (a key player of the industry with complete monopoly in some regions), our team had built an ML model for the prediction of the client's customers expected to default their bill payment in the oncoming month. This was a very useful tool for the client as it not only helped them take measures in advance to pre-empt their customers from defaulting, but also gave them a preview of how much debt is likely to be owed to the company and how long it would take to recover their losses, so they can strategise their income and expense lines accordingly.

One of the important KPIs for the client here is the default rate, i.e. the number of customers that would default from the total of their customer base. The business needed to keep this figure in check. And we, as business analysts, had to provide them with this figure.

The technical setting

The task was simple enough. The data was distributed across two tables in our analytics warehouse:

  1. FACT_DEFAULT_CUSTOMERS: a fact containing the monthly invoicing details of each customer with a flag column indicating whether the customer will default that invoice, and

  2. DIM_CUSTOMER: a dimension containing the attributes of customers, which is required to uniquely identify each customer

This is a standard architecture for most OLAP systems. For the oncoming month, we needed to calculate count of predicted defaulters over the total number of invoiced customers.

The approaches

I was working on this along with a colleague of mine and we each wrote our own query.

Query - 1

When my colleague showed me his query, I was surprised by the difference in perspective. This was what he came up with:

SELECT 
    COUNT(DISTINCT DEFAULT_CUSTOMERS))*100.0
        /COUNT(DISTINCT TOTAL_CUSTOMERS)
FROM (
    SELECT
        CASE WHEN LEFT(BILL_DEFAULTER_DATE_WID,6)='202406' 
            THEN CUSTOMER_ID 
        END AS TOTAL_CUSTOMERS,
        CASE WHEN LEFT(BILL_DEFAULTER_DATE_WID,6)='202406' 
            AND PREDICTED_DEFAULTER=1
            THEN CUSTOMER_ID
        END AS DEFAULT_CUSTOMERS
    FROM FACT_DEFAULT_CUSTOMERS A
    LEFT JOIN DIM_CUSTOMER B ON A.CUSTOMER_WID = B.CUSTOMER_WID
) A

This query is basically adding two columns to the fact. For the first column, if an invoice belongs to the month of June, we grab the customer ID. For the second column, we only pick the customer ID if the invoice is of June and the customer is predicted to default. The first column would thus give you total customers of June and the second column gives you the predicted defaulters of June. What remains is to divide the two to calculate the percentage. Sound logic.

Here's the query output.

Notice that it takes about a minute to run. That's quite fast. But could it be better?

Query - 2

I approached the problem only slightly differently.

SELECT
    PREDICTED_DEFAULT_COUNT*100.0/TOTAL_CUSTOMERS AS DEFAULT_RATE
FROM (
    SELECT
        COUNT(DISTINCT CUSTOMER_ID) AS TOTAL_CUSTOMERS
        ,COUNT(DISTINCT CASE WHEN PREDICTED_DEFAULTER=1 THEN CUSTOMER_ID END) AS PREDICTED_DEFAULTER
    FROM FACT_DEFAULT_CUSTOMERS A
    LEFT JOIN DIM_CUSTOMER B ON A.CUSTOMER_WID = B.CUSTOMER_WID
    WHERE BILL_DEFAULTER_DATE_WID >= '20240601'
) FACT

And here's the output of my query.

Woah! What just happened? It dropped from 1 minute to 4 seconds?! Even though both our queries were roughly the same, the significant difference in the runtimes is due to our trusty old pal, the WHERE clause.

The explanation

This feels like a good place to hark back to a previous statement I made. The client we were working for is a major player in the industry. Naturally, the tables that we're dealing with are significantly large, with millions of records in both.

What is happening in the background

A good starting point to optimising queries is looking at the 'query execution plan'. This is a handy little tool provided by any SQL editor which tells you how SQL is executing the query in the background. Let's look at the execution plans for both the queries (read the plan from right to left).

Query 1 Plan

Query 2 Plan

This is only a part of the query plan(s), and it's enough to tell you that the first query is processing all 19 million rows, while the second query is processing only a handful of records.

SQL query execution order

All SQL queries are executed in a particular order. It is roughly as follows:

  1. FROM

  2. JOIN

  3. WHERE

  4. GROUP BY

  5. HAVING

  6. ORDER BY

This list is by no means comprehensive, but it's good enough for the purpose. Since the WHERE clause is executed before any other operations are performed, the second query only takes a subset of the whole table and processes it. This makes sense, as we only need to get the default rate of the oncoming month, and there is no need to scan the whole table. And just like that, we managed to improve the efficiency by a factor of 15! Unbelievable isn't it?

The conclusion

Sometimes, you just have to go back to the basics. There were a number of other ways the query could have been optimised, including creating indices, partitioning the tables, but none of them could have been as straightforward as this solution. As they say, don't scratch your left ear with your right hand.

I look forward to using this handy tool and start small when trying to optimise queries from now on, before bringing out the big guns. This experience has enabled me to approach query optimisations more methodically and I anticipate benefiting greatly from it.

Shreyan