SQL Performance Explained - Deepstash

Explore the World's Best Ideas

Join today and uncover 100+ curated journeys from 50+ topics. Unlock access to our mobile app with extensive features.

Index for equality first, then for ranges

Index for equality first, then for ranges

Given a table of employees and this query:

SELECT first_name, last_name, date_of_birth

FROM employees

WHERE date_of_birth >= ? AND date_of_birth

AND subsidiary_id = ?

Index(subsidiary_id, date_of_birth) is better than index(date_of_birth, subsidiary_id) because it visits fewer index leaf nodes. The difference is negligible if the date filter is very selective, but, otherwise, the larger the date range, the worse the performance.

15

102 reads

Use longer prefixes in LIKE filters

Use longer prefixes in LIKE filters

When using a LIKE filter on an indexed column, only the prefix before the wildcard (%) is used as an index condition; the remaining characters don't narrow the scanned index range. The longer the prefix, the more efficient the lookup.

Consequently, a LIKE filter with a leading wildcard (e.g. "%ant") can't use an index and has to do a sequential scan over the whole table.

13

59 reads

How to debug a slow query that uses an index

In general, queries are slow because they get too much data from the heap. To find the root cause, get the query's execution plan and look for:

  • Heap-table accesses → which means it's reading too much data from the heap
  • Wide index range scans → which means the index scan is not very selective, resulting in many potential matches, which need to be fetched from heap to determine if they're good matches

15

47 reads

How long it takes to read a table block

The smallest unit of data that a database can read is a table block or a table page, which is usually 8KB. Depending on where that block is located, reading it takes different amounts of time:

  • From memory: 2μs
  • From SSD: 100μs
  • From HDD: 10ms

There's an order of magnitude between them, making it important to have the indexes loaded into memory.

14

34 reads

Only two tables are joined at a time

If you need to join 5 tables, 2 are joined, then their result is joined with another table, and so on. It doesn't matter in which order you write the joins because the database evaluates all orders and chooses the most efficient one.

Furthermore, depending on table statistics, available indexes, and system resources, the actual joining is done using one of these methods: nested loops join, hash join, or merge join.

As a general guideline, to improve join performance, consider creating composite indexes that cover the columns in the WHERE clause, followed by the columns in the JOIN condition.

15

34 reads

Nested loops join

How: It iterates over all rows from one table and for each of them it searches the matching rows in the other table.

When: It's used for small datasets when the overhead of a more advanced join method is not justified, or when one of the tables has an index on the JOIN condition.

Tip: Ensure you have an index on the JOIN condition.

Complexity:

  • without any indexes: O(M*N) time. O(1) space
  • with indexes: O(M*logN) time. O(1) space

14

27 reads

Hash join

How: First, it builds a hash table from the smaller table, where the key is the JOIN key and the value is the necessary tuple data. Then, it scans the other table, hashing the JOIN key of each row and emitting the rows that are also present in the hash table.

When: It's used for joining large datasets with equality conditions, that can fit into the working memory (work_mem).

Tip: Ensure you have enough working memory to avoid hash table spilling to disk.

Complexity:

  • O(M+N) time. O(M) space
  • O(M) to build the hash table + O(N) to scan the other table

14

25 reads

Merge join

How: Sorts both tables by the join condition and merges the results like a zipper. For very large datasets, it splits the dataset into chunks, sorts each chunk in memory, writes each sorted chunk into a temporary file on disk, and finally does a multi-way merge of the sorted files on disk.

When: It's used when at least one input dataset is pre-sorted or for datasets larger than the working memory.

Tip: Ensure datasets are pre-sorted or can be sorted efficiently.

Complexity:

  • for pre-sorted inputs: O(M+N) time. O(1) space
  • otherwise: O(MlogM + NlogN) time. O(M+N) space

15

24 reads

Use indexes to pipeline ORDER BY clauses

Use indexes to pipeline ORDER BY clauses

The sorting operation of an ORDER BY is skipped if there's an index with the same ordering. Moreover, that index allows the ORDER BY to be executed in a pipelined manner i.e. if you request the Top-N results, the query won't need to read and sort the entire result set first.

In contrast, non-pipelined ORDER BYs need to sort all rows, materialize the results, and then return the first N results. They take longer to execute as the table grows in size, whereas the pipelined ORDER BY is unaffected (it's affected by the depth of the index B-tree, but that grows much slower than the table size).

14

21 reads

How a GROUP BY is executed

There are two ways: using a hash table or sorting and aggregating.

  1. The database builds a hash table with the key being the grouped columns, and the value being the aggregated value. This hash table is built in a temporary table and then returned to the client as the result.
  2. The database sorts the rows by the grouped columns, stores them in a temporary table and then aggregates the results.

1 uses less memory because 2 sorts the entire input set, but 2 can use an index to avoid the sorting step and as such it can be pipelined i.e. it can return results before reading the entire input set.

14

21 reads

Pagination methods

Pagination methods

There are two ways to paginate results: offset and seek. The offset method uses the OFFSET keyword, whereas the seek method relies on a column filter to exclude the rows from the previous pages (in practice, this is usually the unique identifier of the last row from the previous page).

In terms of performance, the seek method is better because it doesn't have to process all previously paginated rows (as offset has to do) and is good for infinite scrolling use cases. However, it doesn't allow for page skipping, which can only be achieved with the offset method.

15

25 reads

IDEAS CURATED BY

ocpodariu

Alt account of @ocp. I use it to stash ideas about software engineering

CURATOR'S NOTE

Tips to improve the performance of your SQL queries

Discover Key Ideas from Books on Similar Topics

Read & Learn

20x Faster

without
deepstash

with
deepstash

with

deepstash

Personalized microlearning

100+ Learning Journeys

Access to 200,000+ ideas

Access to the mobile app

Unlimited idea saving

Unlimited history

Unlimited listening to ideas

Downloading & offline access

Supercharge your mind with one idea per day

Enter your email and spend 1 minute every day to learn something new.

Email

I agree to receive email updates