It’s rather incredible what a simple thing like DISTINCT can do the performance. Recently I was tasked with improving performance on a query. It was taking over an hour to complete. Running the plan and watching SQL Monitor it was easy to see that the plan was spending the vast major of time on a SORT UNIQUE step. It was sorting thru 1.4 Billion rows, so that is a lot of data to go thru. But the worse part was there was the same number of rows going into the SORT UNIQUE as coming out. Meaning, all the rows going in were already unique.
Looking at the columns in the select list, one column (an ID
type field) was part of a primary key for a table was in the select list. The other column of the primary key was a numeric
representation of a date, this second column of the primary key was not in the
select list. The actual date was stored
as a separate column. (Why they do this
I have no idea, but there it is.) And
this date column was in the select list along with the ID column. Because of this the ID and Date combination was
unique. The optimizer didn’t and couldn’t
know this because those two columns were not the primary key definition. Hence it had to do the UNIQUE step to find
the unique values, which in this case was all entire set of rows. Why it didn’t do a HASH UNIQUE I couldn’t
figure out, that likely would have been faster than the SORT UNIQUE.
But really the fastest way to do this was to not do it at
all. It was completely unnecessary. Removing
the unneeded DISTINCT chopped off an hour of processing time.
Once again proving the fastest way to do anything is to not
do it at all.
What I hope you get out of this is, know your data. You will always know something the optimizer
doesn’t know. Use that knowledge to
write a better query and help the optimizer make better choices.