Wednesday, August 26, 2020

Cardinality in the cloud or not

I’ve been teaching and doing SQL optimization for many years now. Given a statement that is 5 lines or 500 lines long, the reason an execution plan is not optimal really comes down to the same thing every time. 

The cardinality calculation is off. 

The key is that it should be in the same magnitude as the actual rows returned. Not the exact same value, that’s cool when it happens but not necessary. The value for cardinality is an estimated number of rows, so it’s expected to not be exact. But when it’s way off, either up or down, things go badly. This is when the optimizer picks to use an index, a full scan, a join method, or a join order that causes the plan to go horribly wrong. 

Recently I was at a company that was moving very large queries from another system to Oracle Exadata. These queries were large in just about every sense of the word. The queries were hundreds of lines long many times and some worked with billions of rows in as well. When the cardinality would go wrong everything after that step in the plan just got worse and worse. It wasn’t uncommon for us to fix that sort of thing and take a query from never finishing to completing in a few minutes or even seconds. 

What kinds of things did we do to fix this? Mostly it had to do with changing the code of the query. It wasn’t common that the stats were wrong. And even if they were all we could do was recommend to the DBA team that the stats appeared to be off and hope they would fix it. 

Mostly it was rewrites of the query. 

 One problem was the over use of OUTER joins. Sure, they are needed, but the folks writing these queries tended to use them all the time. This did a few things; one it more or less forced a join order and it didn’t help the cardinality estimate many times. I fixed many queries but changing OUTER joins to INNER joins. The query got the same result set, of course if it didn’t that isn’t a fix! And could run many magnitudes faster because it would get a better join order and/or get a better cardinality. 

Another problem we banged into was a bug with hybrid histograms. This would kick in mostly with very large sets of data. The histogram would lead the optimizer to think it was getting back a very large set of rows many times and very poor decisions would be made because of that. The customer was still on version 12.1 and I’m pretty sure that bug has been fixed. 

 If you notice that columns with a hybrid histogram is giving very bad cardinality estimates you might want change that. How? Of course, you could drop the histogram (really you can’t, you just recollect stats without creating the histogram on that column), we couldn’t do that. We would typically modify that column in the predicate so it couldn’t use the histogram. Mostly we used NVL or COALESCE on the column. I prefer COALESCE myself because it will work on a column even if it’s a primary key, NVL wouldn’t. 

Sometimes it was expressing the predicate(s) in a different way. The simpler a predicate is the less likely a mistake will happen with the math. This typically requires really understanding the question being asked by the query. If you didn’t write the code, then you’ll likely need to chat with whoever did to find out what is going on. 

 Other times we had to resort to using the CARDINALITY hint to get the optimizer to do the right thing. I talked about this in my post "Hints in the cloud or not" this past March. You might want to check that post out too. It’s a quick read. 

The key with optimizing a query is getting that CARINALITY estimate to be in the right magnitude. Walk the plan “up and out” to find were it goes wrong. Stop there and fix that. Which table and predicates are driving the poor estimate? Really did into what that bit of code is doing, can it be rewritten another way? Sometimes one tiny change to a huge query and all is good. Other times it’s an iterative process, make a change, run it, make another and so on. 

Yes, it’s a lot harder to do this then to add more memory or CPU or other resource. But if you don’t fix it and just “add hardware” to make it go away, all you really did was kick the can down the road and it’s going to come back.