Enforcing Uniqueness for Performance
A little while back, I posted a short series on seeks and scans:
One of the things I highlighted in the middle post was the difference between a singleton seek and a range scan:
- A singleton equality seek always retrieves exactly one row, and is guaranteed to do so because a unique index exists to enforce it.
- A range scan seeks down the B-tree to a starting (or ending) point, and scans forward (or backward) from that point using the next or previous page pointers. Todayā€™s short post shows how much faster a singleton seek is, compared with a range scan, even when both return exactly the same number of records.
Test Table and Data
Pretty simple test rig today: A 5 million row table with a single bigint NOT NULL
column with all the values from 1 to 5,000,000 inclusive:
CREATE TABLE dbo.SeekTest
(
col1 bigint NOT NULL
);
GO
CREATE CLUSTERED INDEX cx
ON dbo.SeekTest (col1);
GO
-- 5 million rows numbered 1 to 5,000,000
INSERT dbo.SeekTest
WITH (TABLOCKX)
(col1)
SELECT TOP (5000000)
col1 =
ROW_NUMBER() OVER (
ORDER BY @@SPID)
FROM
master.sys.columns AS C
CROSS JOIN master.sys.columns AS c2
CROSS JOIN master.sys.columns AS c3
ORDER BY
col1;
Notice that the clustered index is not defined as unique. There will not be any uniquifiers (in case you think I am cheating) because SQL Server only adds them when necessary, and there are no duplicates in this example.
Test Query
The test query joins the table to itself, forcing a nested loops join so we get 5 million seek operations:
SELECT
COUNT_BIG(*)
FROM dbo.SeekTest AS ST
WITH (TABLOCK)
JOIN dbo.SeekTest AS ST2
WITH (TABLOCK)
ON ST2.col1 = ST.col1
OPTION (MAXDOP 1, LOOP JOIN, FORCE ORDER);
Here is the execution plan:
On my machine, typical results are:
Table 'SeekTest'.
Scan count 5000001, logical reads 15969038, physical reads 0, read-ahead reads 0
CPU time = 9454 ms, elapsed time = 9472 ms.
Notice the Scan Count.
We get one scan for the Clustered Index Scan iterator, and one scan for every seek operation. This is one way to see that we are getting range scans here ā€” the lack of a unique index means that SQL Server cannot guarantee that only one row will be returned from each seek, so a range scan is performed to pick up any additional matching rows.
Unique Index
Now letā€™s create a second table, where the only difference is that the index is declared as UNIQUE
:
CREATE TABLE dbo.SeekTestUnique
(
col1 bigint NOT NULL
);
GO
CREATE UNIQUE CLUSTERED INDEX cuq
ON dbo.SeekTestUnique (col1);
GO
-- 5 million rows numbered 1 to 5,000,000
INSERT dbo.SeekTestUnique WITH (TABLOCKX)
(col1)
SELECT TOP (5000000)
col1 =
ROW_NUMBER() OVER (
ORDER BY @@SPID)
FROM
master.sys.columns AS C
CROSS JOIN master.sys.columns AS c2
CROSS JOIN master.sys.columns AS c3
ORDER BY
col1;
As before, we join this table to itself using loops join (the TABLOCKs
are just to reduce locking overheads):
SELECT
COUNT_BIG(*)
FROM dbo.SeekTestUnique AS STU
WITH (TABLOCK)
JOIN dbo.SeekTestUnique AS STU2
WITH (TABLOCK)
ON STU2.col1 = STU.col1
OPTION (MAXDOP 1, LOOP JOIN, FORCE ORDER);
We get the same plan:
But with very different performance results:
Table SeekTestUnique
Scan count 1, logical reads 15323030, physical reads 0, read-ahead reads 0
CPU time = 6084 ms, elapsed time = 6096 ms.
Now there is only one scan, because the seeks are all singleton seeks.
Execution time has improved from 9,472ms to 6,096ms just by enforcing uniqueness.
There are lots of good reasons to be explicit about uniqueness where you can (and a few not to). More about that next time.