Tuesday, December 22, 2020

Tracing in the cloud or not

 


Here is a link to a presentation I did about Oracle Trace events 10046 and 10053, on December 21st.  Enjoy!

Click here or copy paste the link below. 

https://www.youtube.com/watch?v=5x8pnMaepxM&feature=youtu.be


And have some pie.  

Tuesday, December 15, 2020

Correlated Sub Queries in the cloud or not


Bottom line up front:  Correlated Sub Queries (or sub-selects) are really bad for performance.  

 

You’re writing a query, and in the result set you need the MAX(ORDER_DATE) based on the customer information in each row being returned.  

 

Ah easy-peeze!  I’ll write a sub-select in the select list to do something like this:

 

custord.customer_number,

(select max(order_date) from customer_orders custord2 

where custord2.customer_id = custord.customer_number) LAST_ORDER_DATE,

 

Great! It works and life is good.  No need for a nasty “group by” thing.  Yea, I’m the man!

 

Or am I?  How often is that sub-select going to run?  Once for every row returned by the main query.  That could be easily thousands or millions depending on what you are doing.  This one is likely pretty harmless, especially if it’s the only one and it can do a unique look up on an index this one you might not even notice it’s a problem, as long as the data set is relatively small. But it wouldn’t scale well.  

 

How about a 4- or 5-way table join for the sub-selects, or more?  I recently had one that was a 12-way table join (Turned out that at least 4 for the tables weren’t needed when we look into it, but still an 8-way join is a lot to do in a sub-select like this.) 

 

Below is part of a SQL monitor plan of a query that had 16 correlated sub-queries in the select list that I recently worked on for a client.  Each one ran rather fast each time, but each one was running 9,794 times and the total run time was over 10 minutes.  Which to bring back only 9,794 rows, that seemed like a lot.  Notice that the number of rows returned and the times each one ran is the same.  You can see this in the EXECITION column compared to the actual rows returned.  Notice that the first one didn’t return a row for each input row.  This was the nature of that sub-query, there was a pretty good chance it wouldn’t return a row each time.  

 

This screen shot only shows a few of them.  Most were a 3 or 4-way table join.  I’ve hidden the full plan for each sub-query, the plan it used was fine.  It was the excessive runs that was the issue. 

 

 A semi-side point, when looking at this output below notice the estimated rows is 1 (one) for each of the sub-queries.  That estimate is correct.  For each run it returned at most 1 row.  The actual rows returned is higher because that is the total for all runs.  The estimate is a “per run” estimate, and is correct.  This is an excellent example of why actual run time stats is necessary for performance optimization.  You might look at the explain plan and think that the one row is just fine.  In reality that is very misleading.  





So, what to do?  I rewrote the query to have a CTE (Common Table Expression, using the WITH clause, also called a Sub-Query Factor) for each sub-query.  I was hoping to be able to combine some to reduce the number of them but that wasn’t possible from what I could see of the queries and the data.  Maybe some could be but it wasn’t obvious from the code.

 

Now I joined each CTE into the main query just like any other table.  Most of the sub-queries were joined on 2 columns and I did join them all in as LEFT OUTER JOINS because the first one I knew for sure wouldn’t return a row for each row in the main query to be safe I did that with all of them.  I was running out of time on this gig so I didn’t get to test with INNER JOINs, but I did let them know that they would likely be able to convert most of the joins to INNER.  

 

This query was running in 33 seconds after this change.  From over 10 minutes to 33 seconds, yea I call that a win.  And with this change the time to run is going to very slowly go up.  For the originally it will go up sharply as the number of rows returned by the main query go up.  In short, the new code is scalable, the old code was not. 

 

Key point I used the MATERILIZE hint on all the CTEs, each CTE was only referenced once in the query so I didn’t want it to be merged/unnested into the main query.  I find that the MATERILZE hint should be used nearly all the time when creating CTEs.  There have been times that I didn’t need it because the optimizer materialized it anyway (typically when you use it more then once).  And a few times when merging in to the main was a good idea.  Test both way when you are working with CTEs. 


What has this got to do with a fire in my fireplace?  Maybe it's you're "burning CPU LIOs and time" when you do these correlated sub-queries.  Or maybe it's just that I like the picture.  

 

Monday, November 2, 2020

Defaults in the cloud or not


 Use the defaults for nearly all the parameters.  

 

I was working on a gig recently and they had the rather interesting situation where queries were running faster in production then in the test environment.  This had started happening rather recently so the question was, why?  

 

Upon investigation, the key thing was the stats were different in the two systems.  I ran 10053 (optimizer trace) events on a sample query on both systems and it was clear that the differences in stats made the two plans different.  The join orders were different and use of indexes, etc.  The stats weren’t a lot different between the two system, but (generally) the tables in TEST were slightly bigger than in PROD. 

 

This was because they were testing new stuff in test.  It certainly makes sense to do this of course, but the issue was since the two systems were different, the plans could be (and were) different.  And they were different enough that the test query I was provided ran in 6 seconds on PROD and 15+ minutes on TEST.  That’s a pretty huge difference.  

 

My first stab at changing the query was to rewrite with a couple of CTEs (Common Table Expressions use the WITH clause).   I isolated two subqueries that were hierarchal with the good old START WITH/CONNECT BY PRIOR syntax.  I noted that these were being evaluated in different parts of the plan between the two so I thought maybe isolating these as CTEs would help.   And it did.  The plan still was a little different between the systems but now both were running in about 6 seconds.  Win!  

 

But that wasn’t really the root problem. I kept digging around to see what else could be a cause.  Since telling them to rewrite all their code with the “right” CTEs might be difficult at best.  And this is where something in the 10053 trace came in real handy.  

 

Near the top of the 10053-trace file is a section on the parameter used by the optimizer.  This section alone can be a very insightful on what is going on.   The sections start with this:

 

***************************************

PARAMETERS USED BY THE OPTIMIZER

********************************

 

And has 2 parts one starting with:

  *************************************

  PARAMETERS WITH ALTERED VALUES

  ******************************

 

The other with:

  *************************************

  PARAMETERS WITH DEFAULT VALUES

  ******************************

 

As the names suggest, they list parameters that have either changed values or are using defaults.  The list of parameters in the ALTERED VALUES should be quite short, the fewer the better.   They didn’t have a lot, which was very good.  But they did have two that really should be defaults today on both PROD and TEST:

 

optimizer_index_cost_adj

optimizer_index_caching             

 

These were common to change back in version 10 when the default cost model went from IO to CPU.  In 9 the default cost model was IO and in 10 the default became CPU which is what is used today.  Back then many shops had issues when first upgrading to 10 that indexes were not getting used appropriately.   Changing these parameters to non-defaults helped.  

 

But that time is over and the CPU model has been adjusted so these parameters should be the defaults.  The defaults are 100 for the cost adjust parameter and 0 for caching.  This means cost the plan as is, don’t increases the cost (a setting over 100) and don’t discount the cost (a setting under 100).  And for the caching parameter it means don’t assume any of the index is cached (in the buffer cache). 

 

I tested the sample query with following hints.  NOTE: Using the OPT_PARAM hit is great for testing; it should not be used for production code:

 

/*+ OPT_PARAM('optimizer_index_cost_adj' '100') OPT_PARAM('optimizer_index_caching' '0')   */

 

And running the original code with these hints on PROD and TEST the query ran in about 6 seconds on each.   Better Win!  The reason this is better is that now it appears that without changing/rewriting the code they can get the performance they expect.  More testing will need to be done by the client to verify this, but the testing that was done while I was on the gig looked quite good. 

 

But why did this work?  I believe that the differences in the stats was just enough to get exaggerated by the non-default parameters to make for different plans that caused poor performance.  The plans were still very slightly different in the two systems, but not enough to cause a performance issue.   

 

The key to this story is, use the defaults Luke, (substitute your name in for Luke).  Oracle does all of its internal testing with default settings for the parameters.  Once you venture into the land of using non-defaults you are in the space of untested.  Clearly to overcome a bug or a particular situation a non-default can be needed.  Once that situation or bug has be fixed, get back to the defaults. 

Wednesday, August 26, 2020

Cardinality in the cloud or not

I’ve been teaching and doing SQL optimization for many years now. Given a statement that is 5 lines or 500 lines long, the reason an execution plan is not optimal really comes down to the same thing every time. 

The cardinality calculation is off. 

The key is that it should be in the same magnitude as the actual rows returned. Not the exact same value, that’s cool when it happens but not necessary. The value for cardinality is an estimated number of rows, so it’s expected to not be exact. But when it’s way off, either up or down, things go badly. This is when the optimizer picks to use an index, a full scan, a join method, or a join order that causes the plan to go horribly wrong. 

Recently I was at a company that was moving very large queries from another system to Oracle Exadata. These queries were large in just about every sense of the word. The queries were hundreds of lines long many times and some worked with billions of rows in as well. When the cardinality would go wrong everything after that step in the plan just got worse and worse. It wasn’t uncommon for us to fix that sort of thing and take a query from never finishing to completing in a few minutes or even seconds. 

What kinds of things did we do to fix this? Mostly it had to do with changing the code of the query. It wasn’t common that the stats were wrong. And even if they were all we could do was recommend to the DBA team that the stats appeared to be off and hope they would fix it. 

Mostly it was rewrites of the query. 

 One problem was the over use of OUTER joins. Sure, they are needed, but the folks writing these queries tended to use them all the time. This did a few things; one it more or less forced a join order and it didn’t help the cardinality estimate many times. I fixed many queries but changing OUTER joins to INNER joins. The query got the same result set, of course if it didn’t that isn’t a fix! And could run many magnitudes faster because it would get a better join order and/or get a better cardinality. 

Another problem we banged into was a bug with hybrid histograms. This would kick in mostly with very large sets of data. The histogram would lead the optimizer to think it was getting back a very large set of rows many times and very poor decisions would be made because of that. The customer was still on version 12.1 and I’m pretty sure that bug has been fixed. 

 If you notice that columns with a hybrid histogram is giving very bad cardinality estimates you might want change that. How? Of course, you could drop the histogram (really you can’t, you just recollect stats without creating the histogram on that column), we couldn’t do that. We would typically modify that column in the predicate so it couldn’t use the histogram. Mostly we used NVL or COALESCE on the column. I prefer COALESCE myself because it will work on a column even if it’s a primary key, NVL wouldn’t. 

Sometimes it was expressing the predicate(s) in a different way. The simpler a predicate is the less likely a mistake will happen with the math. This typically requires really understanding the question being asked by the query. If you didn’t write the code, then you’ll likely need to chat with whoever did to find out what is going on. 

 Other times we had to resort to using the CARDINALITY hint to get the optimizer to do the right thing. I talked about this in my post "Hints in the cloud or not" this past March. You might want to check that post out too. It’s a quick read. 

The key with optimizing a query is getting that CARINALITY estimate to be in the right magnitude. Walk the plan “up and out” to find were it goes wrong. Stop there and fix that. Which table and predicates are driving the poor estimate? Really did into what that bit of code is doing, can it be rewritten another way? Sometimes one tiny change to a huge query and all is good. Other times it’s an iterative process, make a change, run it, make another and so on. 

Yes, it’s a lot harder to do this then to add more memory or CPU or other resource. But if you don’t fix it and just “add hardware” to make it go away, all you really did was kick the can down the road and it’s going to come back.

Monday, July 27, 2020

The Rise of SQL Optimization in the Autonomous Age in the cloud or not

In the world of IT, the age of Autonomous computing in upon us.  I’m an Oracle guy by trade and Oracle is very much all about Autonomous.   Machine Learning is the latest buzz in the profession and that has a lot of folks thinking that tuning and optimization will be completely in the hands of the autonomous database.

 

I tend to disagree.  I believe the need for folk who can write optimal code will be just as needed as we embark on this autonomous path as there ever was, maybe more.  

 

The reason for this is rather simple.  Someone writing code has a purpose for the code, a result set they are interested in seeing.  The better they know the data they are working with and the better then know what they are looking for, the better they can write something to get the results.  This is why many times in my optimization of code I need to talk to the author about what is it they are trying to get in order to find a more optimal way to get the results they want.

 

Of course, I and the autonomous database can do things like throw indexes at it or use hints/tricks to get the code to do this or that.  But maybe the code is just not the right code to begin with.  An outer join that doesn’t need to be, a subquery that really isn’t returning anything useful for the final result set, and maybe it’s just returning data that isn’t even used. 

 

When a query is only tens of lines long, some of these things can be seen with enough looking and testing sometimes.  But when the query is hundreds of lines long and with an execution plan with 1000s lines, it’s difficult for someone who doesn’t know the “why” of the SQL to be able to really optimize it. 

 

This is where the autonomous database will have some trouble optimizing code, if it can do anything at all. Sure, it can solve the performance issue by throwing more horsepower at it (CPUs, Memory, Parallel and the like).   But when the root cause of the performance issue is that the code it not doing the right thing to begin with, the autonomous database will not know that. 

 

The optimizer in the autonomous database is going to take a SQL statement as written, it will run it.  Yes, the optimizer has some limited capacity to do rewrites on the SQL, which can be rather good many times.   But will it know that the user actually doesn’t even need the data from these 6 tables in this complex subquery that is used four times slightly differently each time?  No, and that is where the human SQL optimization specialist will still be needed.  She/he can talk to the user(s) of the query and find out what is the real intent and what do they really want. 

 

How does code end up doing things it shouldn’t do?  Shouldn’t the code know what they are asking and write code to just get that?  Why would someone write code that does extra work for no good reason?  

 

No one writes code like that on purpose I believe.  What happens more often than not is they don’t write the code from scratch.  They take code that sort-of does what they want and then typically add things to it to get it to do what they want.  This will tend to bloat the code in such a way that it ends up doing much more than it should and gets so complex that it’s hard for anyone to read and fully understand, including the optimizer. 

 

There is a saying about tuning that goes like this “Tune the question not the query”.   And that is still going to be a human activity for the foreseeable future.   The optimizer doesn’t know the real question being asked, it knows the representation of it as presented by the SQL. 

Monday, July 20, 2020

Baking a Cake in the Cloud or not

If you want an efficient system, don’t bake a cake at home, let the store bake it for you.

 

I’m a SQL performance guy for the most part.  When I look at SQL a key thing is to understand the nature of the SQL and how it fits in to the application.  This is sort of philosophical, I’m trying to understand the why of the SQL.  And one attribute I look for is how is it being used to process the result set of rows.  

 

And a killer is the ever-popular row-by-row operation.  This is hardly efficient no matter how you look at it.  This is a philosophical issue at the core, how is the writer of the code is seeing the solution from a grand perspective.  

 

A good analogy is baking a cake, and getting the ingredients in particular.  The situation is this, you need to bake a cake and have none of the ingredients in your kitchen to bake this cake.  You have a list of the ingredients; flour, sugar, milk, eggs and so on.  How do you go about buying them?

 

Would you get in your car, drive to the store get the flour, pay for it, get back in your car, drive home put it on the counter, and check off flour from your list.  Then get back I the car, and do this cycle over again for sugar and each ingredient.  I sure hope not.  I suspect you would take the list and go to the store once, in the store get all ingredients in your cart (tram) and pay for the whole lot once and now go home and bake the cake.  

 

But even this isn’t likely best, why even bake the cake at home at all?  How about have the store do all the for you. Many grocery stores (at least in the USA) can bake cakes right in the store, and they are pretty good at it since they have staff and equipment to do it.  (And there are septicity shops where you can get even more elaborate baking creations, which you could do yourself, maybe on the fourth or fifth try.)

 

If you all you want is simple cake, you can just pick it up, pay for it and go.  Akin to a simple query in the database.  If you want to have a custom cake you call ahead and they have it ready for you when you arrive.  More like a complex query with lots of tables and maybe some fancy mathematical operations and such.  

 

Certainly, a lot of folks have gotten past the idea of getting “one row at a time” for their SQL.  But the second part of letting the database do most of the work is where many folks have trouble.  This is mainly due to their philosophy and training on coding, they are very good Java, Python, C++ or like programmers.  To them the database is just data, and they only want to get it from there into their application code.  There are queries bringing back millions of rows and then the application does more processing on them.  This is baking the cake at home, fun as baking a cake is, from a performance point of view it’s not ideal. 

 

Using a stored PL/SQL packaged procedure to get the work done is best.  Push the business logic into the database inside a PL/SQL block.  The database is really good at this and can likely process those millions of rows faster than your application layer can.  This is buying the already baked and decorated cake from the store.  

 

As I said at the top, if you want an efficient system, don’t bake a cake at home, let the store bake it for you.

Thursday, June 11, 2020

Lock down in the cloud or not

Well this sure has been a year.  And the odd thing is that for me, not much changed.  I do like to go to my local pub for a brew which I can’t.  Luckily, they have their own caning machine for the beers and I can still get great brews from them. 

On the professional side I’ve been doing a lot of on line training.  Some is better than others and it’s great to see the innovations that have been made in this area over the years.  So far, I’ve only done recorded type training, no interactive instructor giving the class.  

One that I just finished was quite good and I thought it was great that this guy did the recordings in a room at his house.  It was done over several days as far as I can tell and was really good training and the casualness of it, I thought was great.  

I’ve done some training on non-Oracle stuff (Ack!) and the more I did the more I missed Oracle.  I suppose a big part of that is that I’ve been using Oracle of 1000 years so it’s much more “natural” for me.  

But when I hear one of the speakers in this other product say “We’ve been doing this a really long time, since 2006.” 

I’m like, what?  I don’t call 14 years “a really long time”.   Sound like you’re just getting into a rhythm to me.  But I guess in today’s world where things change in a heartbeat, 14 years can seem like a “really long time”.  

The old saying of the more things change the more they stay the same is true too.  This Cloud thing is in way a lot like when I started collage and I “dialed in” to connect on a 1200 BAUD modem using a land line phone to have access to the computer.  Sure, it’s way cooler and faster (and has color!) then what I worked on, but the basic idea is the same.  “It’s someone else’s computer” that I’m doing work on.

It's clearly the way to go for IT at least right now.  The cost reduction for hardware and such alone for most companies makes it worth it.  

Who knows what will happen in the future, I'm sure something else will come along and make replace the Cloud, at least as we know it today.  What that will be I haven't a clue but change is the only constant in the universe, so something will.  

And speaking of how much cooler and faster the new computer systems are, I’m just finding out about this Oracle VirtualBox and that is really cool.  It is great for training and testing something.  And it looks like you can run some pretty big stuff with it.  I’d heard of it but never used one.  Now that I have, I think I’m hooked.  It will be great to have this on my laptop for testing and such. 

Well this was a more rambling post then normal, but this seems to be a rambling kind of time and it is the name of this blog.  Some days I’m not sure what day it is, and I really can’t believe it’s June already.  I hope everyone is doing well and staying safe. 

Thursday, April 2, 2020

Working from Home

If you are one of the many folks that are suddenly trying to do your job from home, I’d like to take a moment to pass on some things I’ve learned. I’ve been working from home for years. It’s a different mindset. A couple of things I’ve found that might help you deal with this change:
Stay focused. You can get a lot more done at home then in the office. There isn’t the office politics to deal with which give you more time to do your job. On the flip side the job can take over your life. Set yourself hours for work and stick to it. 
Use technology. There is Skype, Zoom, Teams and other programs that can let you connect coworkers and of course your phone. On a gig recently a coworker would typically call me in the morning and he and I would stay on the phone for an hour or more. Not always talking but it was like he was next to me at the office. We could talk about what we were doing and the typical chit-chat. 
Don’t watch TV while working. Some folks can do this, but really its not good. You’ll get less done and feel like you're rushed when you realize you didn’t get such-and-such done. Some music for background noise is likely better and less distracting. 
Eat meals like you would going to an office or job site. Have a breakfast before you start, minimize snacking and have a dedicated lunch time. This keeps you on schedule and like is better for your health then just snaking all the time. 
Get up and move around now and then. Doing a simple domestic task once in a while to help break up the day a bit is good. Or just walk about and check for the mail. Don’t get into a big project, like cleaning the bathroom or painting a room. You wouldn’t do that when at the office or work site so don’t do that when you’re working. 
If you are the manger of folks that are now working from home, you may have the hardest adjustment. This is complex and totally different way to manage folks. In a nutshell you have to trust that they are doing their job. Lording over them and calling them every minute and checking their status on Skype or whatever is a great way to drive them and you crazy. 
Set up scheduled status calls, keep them short. Give them assignments to do and realistic deadlines. Help them with the technology, as best you can get them things they need. For example, having more than one monitor can really boost performance for folks doing a lot of computer-based work. Be open to help them, be it you actually helping them get something done or helping them find the resources they need to get something done. 
Your management skills will really be tested in ways you’ve never imagined. I’ve managed remote teams and it’s a challenge. Hopefully your team will be in the same time zone, that helps a lot. 
Good luck!

Monday, March 30, 2020

Hints in the cloud or not

Hints aren’t hints; they are directives. 
 
Don’t think of a hint as some way of encouraging the optimizer to do something, it is a demand to do something.  Generally, hints are only “ignored” when you miss spell it or put in something that can’t be done.  For example, have the FULL hint and an INDEX hint on the same table.  There are also cases when optimizing will transforms a query in such a way that hints in effect disappear from the final query that the optimizer really uses.  This isn’t common in most shops but can happen.  


Hints are best used rarely and must be documented when used.

If you find yourself “always” using a hint, then you have a problem.  Two hints I’ve seen used “out of habit” are the NO_MERGE and MATERILZE hints.  Just because you have sub-query doesn’t mean you need the NO_MERGE.   It might not be getting merge anyway and maybe it’s a good idea if it is.  The same basic issues are true with MATERLIZE on CTEs (Common Table Expressions or Sub-Query Factors).  The MATERLIZE hint tells the optimizer to resolve it as a separate query, and turn the results into a temporary table like structure.  And often this is exactly what we want to have happen with a CTE.  But the same can be true of a CTE, it sometimes wouldn’t be merged into the main query and even if it is, sometimes this is a good thing. 

If your testing validates that the hint helps the optimizer do the right thing and you can’t get it any other way, document it’s use.  This documentation, doesn’t have to be elaborate and is really for you.  Always document the code as if you will be working on this code again in two years and have no recollection of working on this code before.  Help yourself understand what you did today.  

Hints can be a fast fix for a bug.

I know this may be a shock, but there are bugs in the optimizer.  This is where the use of a hint might be a better solution then some elaborate coding trick.  Cardinality can be an issue here, and the CARDINALITY hint can help.  And CTEs again can be a problem.  If the cardinality estimate is way off for a CTE and that CTE is used many times in the main query, then the odds of the main query having a good plan are slim.  

If you can’t rework the code of the CTE to get a better estimate then a CARINAILTY hint can really save the day.  BUT this is a hint that could go bad over time.  Always put in a comment as to when this was used and that it will need to be checked in the future.  There may come a time it wouldn’t be needed or that the value used needs to be updated.  Keep in mind that the number needs to be in the correct magnitude, not exactly right. 


Hint should be viewed as a temporary fix and not the final solution. 

All this leads up to this, using hints are temporary fixes to a problem.  Often, we are tuning around bugs or poor coding.  If you don’t have access to the author of the query then significate rewriting of the code can be difficult or impossible.  Using a few well-placed hints can get the code to at least run in an acceptable amount of time.   This can at least get thru an “emergency” and hopefully there will come a time when the code can be rewritten or the bug fix is applied and the hints are no longer needed.