SQL: SQs & CTEs

Subqueries and Common Table Expressions are two ways of writing more complex queries in SQL. Both allow for the defining of an intermediate result set which can be used within a larger query. In simpler scenarios, both can be used interchangeably. This blog focuses on the similarities and differences between the two techniques and when to pick one over the other. More advanced techniques relating to each, like correlated subqueries and recursive CTEs, will be covered in a future blog.

Similarities

● Both SQs and CTEs can be used when queries have more than one stage, like needing to compute an aggregate and compare it to row-level values. They can be used to create an intermediate result set which is used inside a query.

● Both SQs and CTEs have temporary lifetime. They exist for the duration it takes for the query that they are defined in to be executed. Once the query finishes running, the result of a CTE or SQ disappears.

This makes them different from tables, temporary tables and views, which persist beyond a single SQL statement.

● Both have limited scope. A CTE can only be referenced by the statement immediately following the WITH clause. Subqueries are more restricted in scope, and are only local to the clause or query block in which they appear. If the same subquery is needed in other parts of the query, it usually needs to be repeated.

Example

Using the F1 schema, we want to find out which drivers have scored more points than that of the average driver.

This will have two stages: (1) calculate the total points per driver and (2) calculate the driver-wide average and compare each driver to it. The average is not being computed over the results table, but over the derived driver-level table.

Using a Subquery

A subquery is a query within another query. It is placed in the clause where its result is needed.

Below is one way to answer the driver points question using multiple subqueries:

SELECT
    dp.forename,
    dp.surname,
    dp.total_points
FROM (
    -- start of subquery: calculate total points per driver
    SELECT
        d.driverid,
        d.forename,
        d.surname,
        SUM(r.points) AS total_points
    FROM results r
    JOIN drivers d
        ON r.driverid = d.driverid
    GROUP BY
        d.driverid,
        d.forename,
        d.surname
    -- end of subquery: calculate total points per driver
) dp
WHERE dp.total_points > (
    -- start of subquery: calculate the average driver total
    SELECT AVG(driver_total)
    FROM (
        -- start of nested subquery: calculate total points per driver
        SELECT
            driverid,
            SUM(points) AS driver_total
        FROM results
        GROUP BY driverid
        -- end of nested subquery: calculate total points per driver
    ) avg_points
    -- end of subquery: calculate the average driver total
)
ORDER BY dp.total_points DESC;

The first subquery creates a driver-level table with total points per driver. The second calculates the average of the driver-level table.

Even in this relatively straightforward example, we have already arrived at a weakness of subqueries: needing to repeat similar or the same logic, resulting in poor readability.

Using a CTE

CTEs let us perform the same logic but do so in a more staged way. CTEs are defined at the top of a query using the keyword WITH and are aliased, or given a name. The alias can be used multiple times in the main query that follows the CTE. Multiple CTEs are separated by commas.

WITH driver_points AS (
    -- start of CTE: calculate total points per driver
    SELECT
        d.driverid,
        d.forename,
        d.surname,
        SUM(r.points) AS total_points
    FROM results r
    JOIN drivers d
        ON r.driverid = d.driverid
    GROUP BY
        d.driverid,
        d.forename,
        d.surname
    -- end of CTE: calculate total points per driver
),

average_driver_points AS (
    -- start of CTE: calculate the average driver total
    SELECT
        AVG(total_points) AS avg_total_points
    FROM driver_points
    -- end of CTE: calculate the average driver total
)

SELECT
    dp.forename,
    dp.surname,
    dp.total_points
FROM driver_points dp
CROSS JOIN average_driver_points adp
WHERE dp.total_points > adp.avg_total_points
ORDER BY dp.total_points DESC;

The code block achieves the same result as the subquery version, but is much easier to read. The main advantage of CTEs is that they allow us to name each intermediate result and repeatedly call on that intermediate result without needing to rewrite any code.

Differences

● Syntax: subqueries are nested inside other queries, whilst CTEs are declared before the main query using a WITH clause.

● Readability: subqueries are fine when they are short and the logic is easy to follow. But once they become nested, they can be hard to read. A reader must understand them inside out.

CTEs are read more naturally. Each is named and the final query is written after the individual CTEs have been defined.

● Reusability (scope): Despite being able to assign aliases to subqueries, you cannot generally reuse the code block by reference to the alias: the entire subquery must be rewritten.

A CTE, by contrast, can be reused indefinitely within the query in which it has been defined.

When to use one over the other

A rough set of rules:

Use a subquery when the logic is small and self-contained
Use a CTE when the query has stages
Use a CTE when the same intermediate result is needed more than once
Use a CTE when readability matters more than keeping the query short

Author:

Shivam Wadhia

View Profile