Friday, 29 July 2011

Enforcing Uniqueness for Performance

A little while back, I posted a short series on seeks and scans:

One of the things I highlighted in the middle post was the difference between a singleton seek and a range scan:

A singleton equality seek always retrieves exactly one row, and is guaranteed to do so because a unique index exists to enforce it.
A range scan seeks down the B-tree to a starting (or ending) point, and scans forward (or backward) from that point using the next or previous page pointers. Today’s short post shows how much faster a singleton seek is, compared with a range scan, even when both return exactly the same number of records.

Test Table and Data

Pretty simple test rig today: A 5 million row table with a single bigint NOT NULL column with all the values from 1 to 5,000,000 inclusive:

CREATE TABLE dbo.SeekTest
(
    col1 bigint NOT NULL
);
GO
CREATE CLUSTERED INDEX cx 
ON dbo.SeekTest (col1);
GO
-- 5 million rows numbered 1 to 5,000,000
INSERT dbo.SeekTest
    WITH (TABLOCKX)
    (col1)
SELECT TOP (5000000)
    col1 = 
        ROW_NUMBER() OVER (
            ORDER BY @@SPID)
FROM 
    master.sys.columns AS C
    CROSS JOIN master.sys.columns AS c2
    CROSS JOIN master.sys.columns AS c3
ORDER BY
    col1;

Notice that the clustered index is not defined as unique. There will not be any uniquifiers (in case you think I am cheating) because SQL Server only adds them when necessary, and there are no duplicates in this example.

Test Query

The test query joins the table to itself, forcing a nested loops join so we get 5 million seek operations:

SELECT 
    COUNT_BIG(*)
FROM dbo.SeekTest AS ST 
    WITH (TABLOCK)
JOIN dbo.SeekTest AS ST2
    WITH (TABLOCK)
    ON ST2.col1 = ST.col1
OPTION (MAXDOP 1, LOOP JOIN, FORCE ORDER);

Here is the execution plan:

Execution plan

On my machine, typical results are:

Table 'SeekTest'.
Scan count 5000001, logical reads 15969038, physical reads 0, read-ahead reads 0
CPU time = 9454 ms, elapsed time = 9472 ms.

Notice the Scan Count.

We get one scan for the Clustered Index Scan iterator, and one scan for every seek operation. This is one way to see that we are getting range scans here — the lack of a unique index means that SQL Server cannot guarantee that only one row will be returned from each seek, so a range scan is performed to pick up any additional matching rows.

Unique Index

Now let’s create a second table, where the only difference is that the index is declared as UNIQUE:

CREATE TABLE dbo.SeekTestUnique
(
    col1 bigint NOT NULL
);
GO
CREATE UNIQUE CLUSTERED INDEX cuq 
ON dbo.SeekTestUnique (col1);
GO
-- 5 million rows numbered 1 to 5,000,000
INSERT dbo.SeekTestUnique WITH (TABLOCKX)
    (col1)
SELECT TOP (5000000)
    col1 = 
        ROW_NUMBER() OVER (
            ORDER BY @@SPID)
FROM 
    master.sys.columns AS C
    CROSS JOIN master.sys.columns AS c2
    CROSS JOIN master.sys.columns AS c3
ORDER BY
    col1;

As before, we join this table to itself using loops join (the TABLOCKs are just to reduce locking overheads):

SELECT 
    COUNT_BIG(*)
FROM dbo.SeekTestUnique AS STU
    WITH (TABLOCK)
JOIN dbo.SeekTestUnique AS STU2
    WITH (TABLOCK)
    ON STU2.col1 = STU.col1
OPTION (MAXDOP 1, LOOP JOIN, FORCE ORDER);

We get the same plan:

Execution plan

But with very different performance results:

Table SeekTestUnique
Scan count 1, logical reads 15323030, physical reads 0, read-ahead reads 0
CPU time = 6084 ms, elapsed time = 6096 ms.

Now there is only one scan, because the seeks are all singleton seeks.

Execution time has improved from 9,472ms to 6,096ms just by enforcing uniqueness.

There are lots of good reasons to be explicit about uniqueness where you can (and a few not to). More about that next time.

Enforcing Uniqueness for Performance

Test Table and Data

Test Query

Unique Index

Further reading