Tuesday, 20 August 2013

Incorrect Results with Merge Join

Introduction

Every product has bugs, and SQL Server is no exception. Using product features in a slightly unusual way (or combining relatively new features together) is a great way to find them. The bug that is the subject of this article is probably reasonably rare in the wild, but it is not a classic edge case. I know of at least one consultant that has encountered it in a production system.

I will start with some relevant background on merge joins. If you are certain you already know everything there is to know about merge join—or just want to cut to the chase—feel free to skip down to the section titled, “The Bug”.

Note: The bug described in this article has been fixed for SQL Server 2014 RTM.

Merge Join

Merge join can be very efficient in the right circumstances. It requires that its inputs are sorted on the join keys and performs best in one-to-many mode (where at least of its inputs is unique on the join keys).

For moderately-sized one-to-many joins, serial merge join is not a bad choice at all—provided the input sorting requirements can be met without performing an explicit sort. Avoiding a sort is most commonly achieved by exploiting the ordering provided by an index.

Merge join can also take advantage of preserved sort order from an earlier, unavoidable sort. One neat thing about merge join is that it can stop processing input rows as soon as either input runs out of rows. Also, merge join does not care whether the input sort order is ascending or descending (though both inputs must be the same).

Example 1

The following example uses a standard Numbers table to illustrate most of the points above:

CREATE TABLE #T1 
(
    col1 integer NOT NULL
    CONSTRAINT PK1 
        PRIMARY KEY CLUSTERED 
            (col1 DESC)
);

CREATE TABLE #T2 
(
    col1 integer NOT NULL
    CONSTRAINT PK2 
        PRIMARY KEY CLUSTERED 
            (col1 DESC)
);

INSERT #T1 (col1)
SELECT n 
FROM dbo.Numbers 
WHERE n BETWEEN 10000 AND 19999;

INSERT #T2 (col1)
SELECT n 
FROM dbo.Numbers 
WHERE n BETWEEN 18000 AND 21999;

Backwards scans

Notice that the indexes enforcing the primary keys on those two tables are defined as descending. The query plan for the INSERT has a number of interesting features:

INSERT execution plan

Reading left to right, the Clustered Index Insert has the “DML Request Sort” property set. This means the operator requires rows in Clustered Index key order. The clustered index (enforcing the primary key in this case) is defined as DESC, so rows with higher values are required to arrive first.

The clustered index on my Numbers table is ordered ASC, so the query optimizer avoids an explicit sort by seeking to the highest match in the Numbers table (21,999) first, then scanning toward the lowest match (18,000) in reverse index order.

The “Plan Tree” view in Plan Explorer shows the reverse (backward) scan clearly:

INSERT Plan Tree

Backward scanning reverses the natural order of the index. A backward scan of an ASC index key returns rows in descending key order; a backward scan of a DESC index key returns rows in ascending key order. The “scan direction” does not indicate returned key order by itself—you have to know whether the index is ASC or DESC to make that determination.

Example 2

Using these test tables and data (T1 has 10,000 rows numbered from 10,000 to 19,999 inclusive; T2 has 4,000 rows numbered from 18,000 to 21,999) the following query joins the two tables together and returns results in descending order of both keys:

SELECT
    T1.col1,
    T2.col1
FROM #T1 AS T1 
JOIN #T2 AS T2 
    ON T2.col1 = T1.col1 
ORDER BY 
    T1.col1 DESC, 
    T2.col1 DESC;

The query returns the correct matching 2,000 rows as you would expect. The post-execution plan is as follows:

Basic Query Analysis

The Merge Join is not running in many-to-many mode because the top input is unique on the join keys and the 2,000 row cardinality estimate is exactly correct.

The Clustered Index Scan of table T2 is ordered (though we have to wait a moment to discover whether that order is forward or backward) and the cardinality estimate of 4,000 rows is also exactly right. The Clustered Index Scan of table T1 is also ordered, but only 2,001 rows were read whereas 10,000 were estimated.

The plan tree view shows both Clustered Index Scans are ordered forward:

Plan Tree View

Recall that reading a DESC index FORWARD will produce rows in reverse key order. This is exactly what is required by the ORDER BY T1.col DESC, T2.col1 DESC clause, so no explicit sort is necessary.

Early termination

Pseudo-code for one-to-many Merge Join (reproduced from Craig Freedman’s blog) is:

get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
begin
    if R1 joins with R2
    begin
        return (R1, R2)
        get next row R2 from input 2
    end
    else if R1 < R2
        get next row R1 from input 1
    else
        get next row R2 from input 2
end

The descending order scan of T1 returns rows starting at 19,999 and working down towards 10,000. The descending order scan of T2 returns rows starting at 21,999 and working down towards 18,000. All 4,000 rows in T2 are eventually read, but the iterative merge process stops when key value 17,999 is read from T1, because T2 runs out of rows.

Merge processing therefore completes without fully reading T1. It reads rows from 19,999 down to 17,999 inclusive—a total of 2,001 rows as shown in the execution plan above. Feel free to re-run the test with ASC indexes instead, also changing the ORDER BY clause from DESC to ASC. The execution plan produced will be very similar and no sorts will be needed.

To summarize points that will be important in a moment: Merge Join requires join-key sorted inputs, but it does not matter whether the keys are sorted ascending or descending.

The Bug

To reproduce the bug, at least one of our tables needs to be partitioned. To keep the results manageable, this example will use just a small number of rows:

CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (5, 10, 15);

CREATE PARTITION SCHEME PS
AS PARTITION PF
ALL TO ([PRIMARY]);

The first table contains two columns, and is partitioned on the primary key:

CREATE TABLE dbo.T1
(
    T1ID integer IDENTITY (1,1) NOT NULL,
    SomeID integer NOT NULL,

    CONSTRAINT [PK dbo.T1 T1ID]
        PRIMARY KEY CLUSTERED (T1ID)
        ON PS (T1ID)
);

The second table is not partitioned. It contains a primary key and a column that will join to the first table:

CREATE TABLE dbo.T2
(
    T2ID integer IDENTITY (1,1) NOT NULL,
    T1ID integer NOT NULL,

    CONSTRAINT [PK dbo.T2 T2ID]
        PRIMARY KEY CLUSTERED (T2ID)
        ON [PRIMARY]
);

Sample data

The first table has 14 rows all with the same value in the SomeID column. SQL Server assigns the identity column values, numbered 1 to 14.

INSERT dbo.T1
    (SomeID)
VALUES
    (123), (123), (123),
    (123), (123), (123),
    (123), (123), (123),
    (123), (123), (123),
    (123), (123);

The second table is simply populated with the identity values from table one:

INSERT dbo.T2 (T1ID)
SELECT T1ID
FROM dbo.T1;

The data in the two tables looks like this:

Sample Data

Test query

The first query joins both tables applying a single WHERE clause predicate (which happens to match all rows):

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123;

The result contains all 14 rows, as expected:

Due to the small number of rows, the optimizer chooses a nested loops join plan for this query:

Nested Loops Plan

The results are the same (and still correct) if we force a hash or merge join:

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (HASH JOIN);

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

Hash and Merge Plans

The Merge Join shown there is one-to-many with a sort on T1ID required for table T2.

Adding a descending index

All is well until one day an administrator adds a descending index on the SomeID column of table 1:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC);

Our query continues to produce correct results when the optimizer chooses a Nested Loops or Hash Join, but it is a different story when a Merge Join is used.

The following coide still uses a query hint to force the Merge Join, but this is just a consequence of the low row counts in the example. The optimizer would naturally choose the same Merge Join plan with different table data.

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

The execution plan is:

Merge Join Incorrect Results

The optimizer has chosen to use the new index, but the query now produces only five rows of output:

Incorrect Results Output

What happened to the other 9 rows? For clarity, this result is incorrect. The data has not changed, so all 14 rows should be returned—as they are with a Nested Loops or Hash Join plan.

Cause and Explanation

The new nonclustered index on SomeID is not declared as unique, so the clustered index key is added to all nonclustered index levels. SQL Server adds the T1ID column (the clustered key) to the nonclustered index just as if we had created the index like this:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID);

Notice the lack of a DESC qualifier on the silently-added T1ID key. Index keys are ASC by default. This is not a problem in itself (though it contributes). The second thing that happens to our index automatically is that it is partitioned in the same way the base table is. So, the full index specification, if we were to write it out explicitly, would be:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID ASC)
ON PS (T1ID);

An optimizer error

This is now quite a complex structure with keys in all sorts of different orders. It is complex enough for the query optimizer to make a mistake when reasoning about the sort order provided by the index.

To illustrate, consider the following simple query:

SELECT 
    T1ID,
    PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID ASC;

The extra column will just show us which partition the current row belongs in. Otherwise, it is just a simple query that returns T1ID values in ascending order WHERE SomeID = 123. Unfortunately, the results are not what is specified by the query:

Incorrect Results

Incorrect results

The query specifies that T1ID values should be returned in ascending order, but that is not what we get.

We do get values in ascending order per partition, but the partitions themselves are returned in reverse order. If the partitions were returned in ascending order (and the T1ID values remained sorted within each partition as shown) the result would be correct.

The query plan shows that the optimizer was confused by the leading DESC key of the index and thought it needed to read the partitions in reverse order for correct results:

Reversed Partitions

The partition seek starts at the right-most partition (4) and proceeds backwards to partition 1.

You might wonder if we could fix the issue by explicitly sorting on partition number ASC in the ORDER BY clause:

SELECT 
    T1ID,
    PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    PartitionID ASC, -- New!
    T1ID ASC;

This query returns the same results (this is not a misprint or a copy/paste error):

Incorrect Results

The partition id is still in descending order (not ascending, as specified) and T1ID is only sorted ascending within each partition. he optimizer really does seem to think that scanning the partitioned leading-descending-key index in a forward direction—but with partitions reversed—will result in the order specified by the query.

Final example

As a final example, consider:

SELECT 
    T1ID
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID DESC;

The results are:

Incorrect Results

Again, the T1ID sort order within each partition is correctly descending, but the partitions are listed backwards (they go from 1 to 3 down the rows). If the partitions were returned in reverse order, the results would correctly be (14, 13, 12, 11, 10, 9, … 5, 4, 3, 2, 1).

Back to the Merge join

The cause of the incorrect results with the Merge Join query is now apparent:

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

Merge Revisited

The Merge Join requires sorted inputs. The input from T2 is explicitly sorted by T1TD so that is ok. The optimizer incorrectly reasons that the index on T1 can provide rows in T1ID order. As we have seen, this is not the case. The Index Seek produces the same output as a query we have already seen:

SELECT 
    T1ID
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID ASC;

Query Results

Only the first 5 rows are in T1ID order. The next value (5) is certainly not in ascending order, and the Merge Join interprets this as end-of-stream rather than producing an error (I expected an assertion failure here). Anyway, the effect is that the Merge Join incorrectly finishes processing early.

As a reminder, the (incomplete) results are:

Incorrect Query Results

Conclusion

This is a serious bug—a straightforward index seek can return results that do not respect the ORDER BY clause. Moreover, the optimizer’s internal reasoning seems to be broken for partitioned non-unique nonclustered indexes with a descending leading key.

Yes, this is a slightly unusual arrangement. But as we have seen, correct results can be suddenly replaced by incorrect results just because someone added a descending index. Remember the added index looked innocent enough: No explicit ASC/DESC key mismatch, and no explicit partitioning.

The bug is not limited to merge joins. Potentially any query that involves a partitioned table and which relies on index sort order (explicit or implicit) could fall victim.

This bug exists in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. SQL Server 2005 used a different implementation model for partitioning (based on APPLY) and does not suffer from this issue.

My Connect item for this issue is at Incorrect Results - Partitioned Nonclustered Index with Descending Key (web archive link).

Update

The fix for this issue is now available and documented in a KB 2892741.

Please note the fix requires a SQL Server code patch and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect results bug to be fixed under 4199. I asked for clarification on that and the response was:

Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.

Thanks for reading.