Wednesday, January 13, 2021

Getting WITH it in the cloud or not


The WITH clause in Oracle is used to create CTEs, Common Table Expressions also called Sub-Query Factors.  Oracle as a company hasn’t promoted these nearly enough from my point of view, it’s been available since 9.2.  Which is quite a while ago and still many folks are unaware of it.  This little post is to give CTEs a of bit nudge into the light. 

 

So, what is this “CTE” thing?  It’s easiest to think of it as a temporary view that is useable for the duration of the query.  Once the query is finished, they disappear.  Even the syntax is like a view definition:

 

For a view:  CREATE VIEW my_favorite_view AS (select …

 

For a CTE: WITH my_favorite_cte as (select …

 

Generally, I recommend using the MATERLIZE hint for CTEs:

 

WITH my_favorite_cte AS (select /*+ MATERIALIZE */ … 

This hint forces the optimizer to not merge the CTE back into the main query.  Typically, you just did all this work to get it out of there and you don’t want it going back in.  Of course, testing is a great idea, test with and without the hint to be sure you are getting what you want. 

 

Once the CTE is created, use it like you would any table in your query.  You can select from it or join it into a query just as you would any table.  You can even select from an earlier defined CTE in a later one, but you can’t refer to one that hasn’t been defined yet.  No “forward referencing” here. 

 

You create these because they can be very powerful for query optimization. Amazingly powerful.  Here are some reasons to use them. 

 

The most notable cases are when a sub-query within a query is used many times.  Recently I posted about the use with correlated sub-queries in the select list.  Read the post here.  This can be very dramatic in reducing the time and resources used by a query.  

 

Another common case is a subquery that reused many times but  is “slightly” different each time.  For example, a join of a set of tables and the WHERE clause limits it each time on the same set of columns, just different values.  A CTE that gets the super set of the rows used throughout the query can be made, then select from that each time just getting the rows needed for a particular part of the query.  

 

The savings is that the query getting the super set runs only one time, and that set is stored in memory.  Then as it’s joined into each part, only get from it the rows needed.  This is especially good when the set of rows in the super set is much smaller than the total size of all the rows in the joined set of tables.   For example, there are 5 tables in the subquery, total rows of hundreds of millions, but the super set is only tens of thousands.   There is no set number of rows or ratio for this, test to find what works and what doesn’t. 

 

 

Queries using the UNION operator commonly have subqueries that are the same in each part.  Here is another great use of CTEs.  Take the common query and make it a CTE, now just refer to it in each of the UNION sections.

 

You might even use them just to make the query easier to read.  Breaking up the query in to “steps” with the CTEs can make debugging and modification easier later.   This is where maybe the MATERIALIZE hint isn’t needed.  The code might run just fine after the Optimizer smashes it all back together into one query.   

Some personal recommendation on using them.

 

Please give them intelligent names.  Calling them T1, T2 and so on is very painful for debugging.  To further help in reading the query, name them with _CTE as a suffix.  This makes it easier to see what are tables and what are CTEs in large statements.  

 

These next two are meant to keep memory use down.   Each CTE when MATERIALIZED, which more often is what you want, becomes a table like structure in your PGA.  Some guidelines to consider when creating CTEs: 

·      Keep the number of them down.  If you’re creating more than about 6-8 CTEs make sure they really do the trick.  

 

·      Also try to have CTEs that have a “few” number of rows in them.  About a million or less is great.

 

There isn’t really a hard limit of the number of CTEs and the number of rows in one.   Maybe there is a hard limit on the number of CTEs but it you find out what it is, you likely have other problems already.  The point with keeping the number and size down is to conserve memory.  

 

On the other hand, if you have 16 CTEs and the query goes from hours to seconds, then it’s memory well spent.   Same for size, a 500 million row CTE that cuts the run time in a like manner is worth the memory as well.  Over all you might end up uses less memory then when the query runs for hours when exceeding my guidelines purposed here.  Test, test and test again. 

 

It’s time to get WITH it and use CTEs in your queries.