It’s rather incredible what a simple thing like DISTINCT can do the performance. Recently I was tasked with improving performance on a query. It was taking over an hour to complete. Running the plan and watching SQL Monitor it was easy to see that the plan was spending the vast major of time on a SORT UNIQUE step. It was sorting thru 1.4 Billion rows, so that is a lot of data to go thru. But the worse part was there was the same number of rows going into the SORT UNIQUE as coming out. Meaning, all the rows going in were already unique.
Looking at the columns in the select list, one column (an ID type field) was part of a primary key for a table was in the select list. The other column of the primary key was a numeric representation of a date, this second column of the primary key was not in the select list. The actual date was stored as a separate column. (Why they do this I have no idea, but there it is.) And this date column was in the select list along with the ID column. Because of this the ID and Date combination was unique. The optimizer didn’t and couldn’t know this because those two columns were not the primary key definition. Hence it had to do the UNIQUE step to find the unique values, which in this case was all entire set of rows. Why it didn’t do a HASH UNIQUE I couldn’t figure out, that likely would have been faster than the SORT UNIQUE.
But really the fastest way to do this was to not do it at all. It was completely unnecessary. Removing the unneeded DISTINCT chopped off an hour of processing time.
Once again proving the fastest way to do anything is to not do it at all.
What I hope you get out of this is, know your data. You will always know something the optimizer doesn’t know. Use that knowledge to write a better query and help the optimizer make better choices.