Monday, December 4, 2017

Starts and subqueries in the cloud or not.

When doing some work the more efficiently we do it the better.  This is true in real life and with our queries we write in SQL.  Imagine you had a landscape supply company deliver a load of bricks you will use for a patio.  The problem is they are sitting in your drive way and the patio is in the back yard.  You could move them one at a time from the driveway to the back of the house.  Or you could get a wheel barrel and move them several at a time.  Which way do you think you’ll move the bricks faster?

The same can happen with our queries.  And the STARTS column in the run time stats of the plan can help.  You will see this stat using the format option IOSTATS or ALLSTATS.  You can also see it in the STAT lines starting in 12.2 in the 10046 trace files.

You need information from a table within the query.  Certainly a natural way to do this is to write a subquery to get it.  For example I want a report of customers, the number of orders and the average order amount.  I could write it like this (the tables I’m using here are nonsensical but do the same thing):

select /*+ qb_name(mainblk) gather_plan_statistics */ a.username customer_name,
 (select /*+ qb_name(cntblk) */ nvl(count(*),0) from scalar_allobjects b where b.owner = a.username) cnt_orders,
 (select /*+ qb_name(avgblk) */ nvl(round(avg(object_id),2),0) from scalar_allobjects b where b.owner = a.username) avg_order_amt
from scalar_allusers a
order by a.username
/

When it runs, the run time stats look like this, notice the STARTS column in particular.

--------------------------------------------------------------------------------------------------
| Id  | Operation          | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                   |      1 |        |     48 |00:00:00.55 |     128K|
|   1 |  SORT AGGREGATE    |                   |     48 |      1 |     48 |00:00:00.27 |   64320 |
|*  2 |   TABLE ACCESS FULL| SCALAR_ALLOBJECTS |     48 |   2167 |  63383 |00:00:00.28 |   64320 |
|   3 |  SORT AGGREGATE    |                   |     48 |      1 |     48 |00:00:00.28 |   64320 |
|*  4 |   TABLE ACCESS FULL| SCALAR_ALLOBJECTS |     48 |   2167 |  63383 |00:00:00.30 |   64320 |
|   5 |  SORT ORDER BY     |                   |      1 |     48 |     48 |00:00:00.55 |     128K|
|   6 |   TABLE ACCESS FULL| SCALAR_ALLUSERS   |      1 |     48 |     48 |00:00:00.01 |       9 |
--------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - CNTBLK
   2 - CNTBLK  / B@CNTBLK
   3 - AVGBLK
   4 - AVGBLK  / B@AVGBLK
   5 - MAINBLK
   6 - MAINBLK / A@MAINBLK

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("B"."OWNER"=:B1)
   4 - filter("B"."OWNER"=:B1)

So it’s doing some 48 full table scans for each sub query, on the same table.  That’s 96 full table scans on that table alone.  An index might help, but really the problem is that the subqueries are running way to many times. How about a join of the two tables instead? It has to be an outer join, because there are some customers who have not placed any orders yet so they wouldn’t show up at all with an inner join.  Also we can’t do a COUNT(*) because that counts the occurrence of a row, and with an outer join we’d get a row even where there isn’t a match.  So the query would look like this:

select /*+ qb_name(mainblk) gather_plan_statistics */ a.username customer_name,
 nvl(count(b.owner),0) cnt_orders,
 nvl(round(avg(b.object_id),2),0) avg_order_amt
from scalar_allusers a, scalar_allobjects b
where b.owner(+) = a.username
group by a.username
order by a.username
/

When this on runs its stats look like this:

---------------------------------------------------------------------------------------------------
| Id  | Operation           | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                   |      1 |        |     48 |00:00:00.02 |    1349 |
|   1 |  SORT GROUP BY      |                   |      1 |     35 |     48 |00:00:00.02 |    1349 |
|*  2 |   HASH JOIN OUTER   |                   |      1 |  75842 |  63397 |00:00:00.01 |    1349 |
|   3 |    TABLE ACCESS FULL| SCALAR_ALLUSERS   |      1 |     48 |     48 |00:00:00.01 |       9 |
|   4 |    TABLE ACCESS FULL| SCALAR_ALLOBJECTS |      1 |  75829 |  75829 |00:00:00.01 |    1340 |
---------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - MAINBLK
   3 - MAINBLK / A@MAINBLK
   4 - MAINBLK / B@MAINBLK

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"."OWNER"="A"."USERNAME")

Now there is only one full table scan on each table.  Notice the starts column this time.  Also it does a significantly lower amount of LIOs (the buffers column), the first one doing over 128,000 and the second one at 1,349.  And to boot, it’s over ten times faster, no surprise when it’s doing about 1% of the LIOs.  I did run this test many times and it was always over ten times faster. 
 

The STARTS column is a great way to see if there is some part of the plan that is running excessively.  If you see this there are 3 basic attack patterns: turn it into a join (as done here), use the WITH clause and turn it into a subquery factor and join that into the main query, or put it into the FROM clause as an inline view which is joined to the other tables.  


All these have the same net effect, join the table into the query don’t select from it as a subquery that is run repeatedly.   More often than not this is going to do less work and hence will be faster. 


Here is the full script I used for this example:

set serveroutput off
column customer_name format a30
column cnt_orders    format 99,999,999
column avg_order_amt format 99,999,999.99

set termout off
select /*+ qb_name(mainblk) gather_plan_statistics */ a.username customer_name,
 (select /*+ qb_name(cntblk) */ nvl(count(*),0) from scalar_allobjects b where b.owner = a.username) cnt_orders,
 (select /*+ qb_name(avgblk) */ nvl(round(avg(object_id),2),0) from scalar_allobjects b where b.owner = a.username) avg_order_amt
from scalar_allusers a
order by a.username
/
set termout on

select * from table(dbms_xplan.display_cursor (format=>'iostats last alias'))
/

set termout off
select /*+ qb_name(mainblk) gather_plan_statistics */ a.username customer_name,
 nvl(count(b.owner),0) cnt_orders,
 nvl(round(avg(b.object_id),2),0) avg_order_amt
from scalar_allusers a, scalar_allobjects b
where b.owner(+) = a.username
group by a.username
order by a.username
/
set termout on

select * from table(dbms_xplan.display_cursor (format=>'iostats last alias'))
/

Creating the two tables:

create table scalar_allobjects as select * from all_objects;

alter table scalar_allobjects add (constraint allobjects_pkey primary key (object_id));

create table scalar_allusers as select * from all_users;

alter table scalar_allusers add (constraint allusers_pk primary key (user_id));

exec dbms_stats.gather_table_stats (user, 'scalar_allobjects', method_opt => 'for all columns size 1', -
     estimate_percent => 100, cascade => TRUE, no_invalidate => FALSE) ;
    
exec dbms_stats.gather_table_stats (user, 'scalar_allusers', method_opt => 'for all columns size 1', -
     estimate_percent => 100, cascade => TRUE, no_invalidate => FALSE) ;