Ric Ramblings: December 2020

Bottom line up front: Correlated Sub Queries (or sub-selects) are really bad for performance.

You’re writing a query, and in the result set you need the MAX(ORDER_DATE) based on the customer information in each row being returned.

Ah easy-peeze! I’ll write a sub-select in the select list to do something like this:

custord.customer_number,

(select max(order_date) from customer_orders custord2

where custord2.customer_id = custord.customer_number) LAST_ORDER_DATE,

…

Great! It works and life is good. No need for a nasty “group by” thing. Yea, I’m the man!

Or am I? How often is that sub-select going to run? Once for every row returned by the main query. That could be easily thousands or millions depending on what you are doing. This one is likely pretty harmless, especially if it’s the only one and it can do a unique look up on an index this one you might not even notice it’s a problem, as long as the data set is relatively small. But it wouldn’t scale well.

How about a 4- or 5-way table join for the sub-selects, or more? I recently had one that was a 12-way table join (Turned out that at least 4 for the tables weren’t needed when we look into it, but still an 8-way join is a lot to do in a sub-select like this.)

Below is part of a SQL monitor plan of a query that had 16 correlated sub-queries in the select list that I recently worked on for a client. Each one ran rather fast each time, but each one was running 9,794 times and the total run time was over 10 minutes. Which to bring back only 9,794 rows, that seemed like a lot. Notice that the number of rows returned and the times each one ran is the same. You can see this in the EXECITION column compared to the actual rows returned. Notice that the first one didn’t return a row for each input row. This was the nature of that sub-query, there was a pretty good chance it wouldn’t return a row each time.

This screen shot only shows a few of them. Most were a 3 or 4-way table join. I’ve hidden the full plan for each sub-query, the plan it used was fine. It was the excessive runs that was the issue.

A semi-side point, when looking at this output below notice the estimated rows is 1 (one) for each of the sub-queries. That estimate is correct. For each run it returned at most 1 row. The actual rows returned is higher because that is the total for all runs. The estimate is a “per run” estimate, and is correct. This is an excellent example of why actual run time stats is necessary for performance optimization. You might look at the explain plan and think that the one row is just fine. In reality that is very misleading.

So, what to do? I rewrote the query to have a CTE (Common Table Expression, using the WITH clause, also called a Sub-Query Factor) for each sub-query. I was hoping to be able to combine some to reduce the number of them but that wasn’t possible from what I could see of the queries and the data. Maybe some could be but it wasn’t obvious from the code.

Now I joined each CTE into the main query just like any other table. Most of the sub-queries were joined on 2 columns and I did join them all in as LEFT OUTER JOINS because the first one I knew for sure wouldn’t return a row for each row in the main query to be safe I did that with all of them. I was running out of time on this gig so I didn’t get to test with INNER JOINs, but I did let them know that they would likely be able to convert most of the joins to INNER.

This query was running in 33 seconds after this change. From over 10 minutes to 33 seconds, yea I call that a win. And with this change the time to run is going to very slowly go up. For the originally it will go up sharply as the number of rows returned by the main query go up. In short, the new code is scalable, the old code was not.

Key point I used the MATERILIZE hint on all the CTEs, each CTE was only referenced once in the query so I didn’t want it to be merged/unnested into the main query. I find that the MATERILZE hint should be used nearly all the time when creating CTEs. There have been times that I didn’t need it because the optimizer materialized it anyway (typically when you use it more then once). And a few times when merging in to the main was a good idea. Test both way when you are working with CTEs.

What has this got to do with a fire in my fireplace? Maybe it's you're "burning CPU LIOs and time" when you do these correlated sub-queries. Or maybe it's just that I like the picture.

Tuesday, December 22, 2020

Tracing in the cloud or not

Tuesday, December 15, 2020

Correlated Sub Queries in the cloud or not

Blog Archive