Monday, November 2, 2020

Defaults in the cloud or not


 Use the defaults for nearly all the parameters.  

 

I was working on a gig recently and they had the rather interesting situation where queries were running faster in production then in the test environment.  This had started happening rather recently so the question was, why?  

 

Upon investigation, the key thing was the stats were different in the two systems.  I ran 10053 (optimizer trace) events on a sample query on both systems and it was clear that the differences in stats made the two plans different.  The join orders were different and use of indexes, etc.  The stats weren’t a lot different between the two system, but (generally) the tables in TEST were slightly bigger than in PROD. 

 

This was because they were testing new stuff in test.  It certainly makes sense to do this of course, but the issue was since the two systems were different, the plans could be (and were) different.  And they were different enough that the test query I was provided ran in 6 seconds on PROD and 15+ minutes on TEST.  That’s a pretty huge difference.  

 

My first stab at changing the query was to rewrite with a couple of CTEs (Common Table Expressions use the WITH clause).   I isolated two subqueries that were hierarchal with the good old START WITH/CONNECT BY PRIOR syntax.  I noted that these were being evaluated in different parts of the plan between the two so I thought maybe isolating these as CTEs would help.   And it did.  The plan still was a little different between the systems but now both were running in about 6 seconds.  Win!  

 

But that wasn’t really the root problem. I kept digging around to see what else could be a cause.  Since telling them to rewrite all their code with the “right” CTEs might be difficult at best.  And this is where something in the 10053 trace came in real handy.  

 

Near the top of the 10053-trace file is a section on the parameter used by the optimizer.  This section alone can be a very insightful on what is going on.   The sections start with this:

 

***************************************

PARAMETERS USED BY THE OPTIMIZER

********************************

 

And has 2 parts one starting with:

  *************************************

  PARAMETERS WITH ALTERED VALUES

  ******************************

 

The other with:

  *************************************

  PARAMETERS WITH DEFAULT VALUES

  ******************************

 

As the names suggest, they list parameters that have either changed values or are using defaults.  The list of parameters in the ALTERED VALUES should be quite short, the fewer the better.   They didn’t have a lot, which was very good.  But they did have two that really should be defaults today on both PROD and TEST:

 

optimizer_index_cost_adj

optimizer_index_caching             

 

These were common to change back in version 10 when the default cost model went from IO to CPU.  In 9 the default cost model was IO and in 10 the default became CPU which is what is used today.  Back then many shops had issues when first upgrading to 10 that indexes were not getting used appropriately.   Changing these parameters to non-defaults helped.  

 

But that time is over and the CPU model has been adjusted so these parameters should be the defaults.  The defaults are 100 for the cost adjust parameter and 0 for caching.  This means cost the plan as is, don’t increases the cost (a setting over 100) and don’t discount the cost (a setting under 100).  And for the caching parameter it means don’t assume any of the index is cached (in the buffer cache). 

 

I tested the sample query with following hints.  NOTE: Using the OPT_PARAM hit is great for testing; it should not be used for production code:

 

/*+ OPT_PARAM('optimizer_index_cost_adj' '100') OPT_PARAM('optimizer_index_caching' '0')   */

 

And running the original code with these hints on PROD and TEST the query ran in about 6 seconds on each.   Better Win!  The reason this is better is that now it appears that without changing/rewriting the code they can get the performance they expect.  More testing will need to be done by the client to verify this, but the testing that was done while I was on the gig looked quite good. 

 

But why did this work?  I believe that the differences in the stats was just enough to get exaggerated by the non-default parameters to make for different plans that caused poor performance.  The plans were still very slightly different in the two systems, but not enough to cause a performance issue.   

 

The key to this story is, use the defaults Luke, (substitute your name in for Luke).  Oracle does all of its internal testing with default settings for the parameters.  Once you venture into the land of using non-defaults you are in the space of untested.  Clearly to overcome a bug or a particular situation a non-default can be needed.  Once that situation or bug has be fixed, get back to the defaults.