Sunday, April 30, 2023

Optimizing your SQL starts before you write any code


 Writing good SQL is more about how you think than what you write.  The problem that most of us have is that we think in terms of individual rows when we really need to think more in terms of sets of rows.   

 

 

Of course, this is natural for us to do since the end product is the individual rows, and we don’t necessarily think of them as a set.  But they are.  They are a set that satisfies the conditions that the business defines for the data they want. 

 

 

 We need to think this way always when writing and optimizing our SQL.  Too often we write code that is very much “row-by-row” (or as Tome Kyte would say “row-by-slow”).  The classic example is the correlated subquery in the SELECT list.  This will return a row for each row defined by the calling (outer) query.  The optimizer has a neat trick called subquery caching where it will cache values for “repeating” values coming into it.  That is great and the subquery can run less frequently.  

 

 

However, this isn’t always possible. If the values being passed in are relatively unique the saving by this can be minimal, and if totally unique it doesn’t help at all.  Also, it only can cache so many values (version dependent) and there is a problem of HASH collisions as it uses a hash algorithm to store the values in a hash table.   Bottom line, don’t count on this to save your performance. 

 

 

Better to not have correlated subqueries in the select list.  Of course, ones that run for a very few distinct values can be fine.  What I talking about is the ones that run thousands or millions of times.  Just making these run faster isn’t really solving the problem either, you’re just kicking the can down the road.  As the data volume increases (and it always does) this will eventually not be fast enough. 

 

 

I’m a huge fan of CTEs (Common Table Expressions).  I’ve written several posts about these.  In my experience CTEs can solve something like 80% of performance problems.   And for the correlated subquery they are the best solution. 

 

 

This is also thinking  in terms of sets of rows (values), not one at a time.  For that correlated subquery, how can I get all the possible values I would need for the subquery all at once and not one at a time?  The CTE is the perfect solution.   This isn’t the only way they can be used to help performance, but it is one of the most powerful.