Incorrect Results with Merge Join
Introduction
Every product has bugs, and SQL Server is no exception. Using product features in a slightly unusual way (or combining relatively new features together) is a great way to find them. The bug that is the subject of this article is probably reasonably rare in the wild, but it is not a classic edge case. I know of at least one consultant that has encountered it in a production system.
I will start with some relevant background on merge joins. If you are certain you already know everything there is to know about merge join—or just want to cut to the chase—feel free to skip down to the section titled, “The Bug”.
Note: The bug described in this article has been fixed for SQL Server 2014 RTM.
Merge Join
Merge join can be very efficient in the right circumstances. It requires that its inputs are sorted on the join keys and performs best in one-to-many mode (where at least of its inputs is unique on the join keys).
For moderately-sized one-to-many joins, serial merge join is not a bad choice at all—provided the input sorting requirements can be met without performing an explicit sort. Avoiding a sort is most commonly achieved by exploiting the ordering provided by an index.
Merge join can also take advantage of preserved sort order from an earlier, unavoidable sort. One neat thing about merge join is that it can stop processing input rows as soon as either input runs out of rows. Also, merge join does not care whether the input sort order is ascending or descending (though both inputs must be the same).
Example 1
The following example uses a standard Numbers table to illustrate most of the points above:
CREATE TABLE #T1
(
col1 integer NOT NULL
CONSTRAINT PK1
PRIMARY KEY CLUSTERED
(col1 DESC)
);
CREATE TABLE #T2
(
col1 integer NOT NULL
CONSTRAINT PK2
PRIMARY KEY CLUSTERED
(col1 DESC)
);
INSERT #T1 (col1)
SELECT n
FROM dbo.Numbers
WHERE n BETWEEN 10000 AND 19999;
INSERT #T2 (col1)
SELECT n
FROM dbo.Numbers
WHERE n BETWEEN 18000 AND 21999;
Backwards scans
Notice that the indexes enforcing the primary keys on those two tables are defined as descending. The query plan for the INSERT
has a number of interesting features:
Reading left to right, the Clustered Index Insert has the “DML Request Sort” property set. This means the operator requires rows in Clustered Index key order. The clustered index (enforcing the primary key in this case) is defined as DESC
, so rows with higher values are required to arrive first.
The clustered index on my Numbers table is ordered ASC
, so the query optimizer avoids an explicit sort by seeking to the highest match in the Numbers table (21,999) first, then scanning toward the lowest match (18,000) in reverse index order.
The “Plan Tree” view in Plan Explorer shows the reverse (backward) scan clearly:
Backward scanning reverses the natural order of the index. A backward scan of an ASC
index key returns rows in descending key order; a backward scan of a DESC
index key returns rows in ascending key order. The “scan direction” does not indicate returned key order by itself—you have to know whether the index is ASC
or DESC
to make that determination.
Example 2
Using these test tables and data (T1
has 10,000 rows numbered from 10,000 to 19,999 inclusive; T2
has 4,000 rows numbered from 18,000 to 21,999) the following query joins the two tables together and returns results in descending order of both keys:
SELECT
T1.col1,
T2.col1
FROM #T1 AS T1
JOIN #T2 AS T2
ON T2.col1 = T1.col1
ORDER BY
T1.col1 DESC,
T2.col1 DESC;
The query returns the correct matching 2,000 rows as you would expect. The post-execution plan is as follows:
The Merge Join is not running in many-to-many mode because the top input is unique on the join keys and the 2,000 row cardinality estimate is exactly correct.
The Clustered Index Scan of table T2
is ordered (though we have to wait a moment to discover whether that order is forward or backward) and the cardinality estimate of 4,000 rows is also exactly right. The Clustered Index Scan of table T1
is also ordered, but only 2,001 rows were read whereas 10,000 were estimated.
The plan tree view shows both Clustered Index Scans are ordered forward:
Recall that reading a DESC
index FORWARD
will produce rows in reverse key order. This is exactly what is required by the ORDER BY T1.col DESC, T2.col1 DESC
clause, so no explicit sort is necessary.
Early termination
Pseudo-code for one-to-many Merge Join (reproduced from Craig Freedman’s blog) is:
get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
begin
if R1 joins with R2
begin
return (R1, R2)
get next row R2 from input 2
end
else if R1 < R2
get next row R1 from input 1
else
get next row R2 from input 2
end
The descending order scan of T1
returns rows starting at 19,999 and working down towards 10,000. The descending order scan of T2
returns rows starting at 21,999 and working down towards 18,000. All 4,000 rows in T2
are eventually read, but the iterative merge process stops when key value 17,999 is read from T1
, because T2
runs out of rows.
Merge processing therefore completes without fully reading T1
. It reads rows from 19,999 down to 17,999 inclusive—a total of 2,001 rows as shown in the execution plan above. Feel free to re-run the test with ASC
indexes instead, also changing the ORDER BY
clause from DESC
to ASC
. The execution plan produced will be very similar and no sorts will be needed.
To summarize points that will be important in a moment: Merge Join requires join-key sorted inputs, but it does not matter whether the keys are sorted ascending or descending.
The Bug
To reproduce the bug, at least one of our tables needs to be partitioned. To keep the results manageable, this example will use just a small number of rows:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (5, 10, 15);
CREATE PARTITION SCHEME PS
AS PARTITION PF
ALL TO ([PRIMARY]);
The first table contains two columns, and is partitioned on the primary key:
CREATE TABLE dbo.T1
(
T1ID integer IDENTITY (1,1) NOT NULL,
SomeID integer NOT NULL,
CONSTRAINT [PK dbo.T1 T1ID]
PRIMARY KEY CLUSTERED (T1ID)
ON PS (T1ID)
);
The second table is not partitioned. It contains a primary key and a column that will join to the first table:
CREATE TABLE dbo.T2
(
T2ID integer IDENTITY (1,1) NOT NULL,
T1ID integer NOT NULL,
CONSTRAINT [PK dbo.T2 T2ID]
PRIMARY KEY CLUSTERED (T2ID)
ON [PRIMARY]
);
Sample data
The first table has 14 rows all with the same value in the SomeID
column. SQL Server assigns the identity column values, numbered 1 to 14.
INSERT dbo.T1
(SomeID)
VALUES
(123), (123), (123),
(123), (123), (123),
(123), (123), (123),
(123), (123), (123),
(123), (123);
The second table is simply populated with the identity values from table one:
INSERT dbo.T2 (T1ID)
SELECT T1ID
FROM dbo.T1;
The data in the two tables looks like this:
Test query
The first query joins both tables applying a single WHERE
clause predicate (which happens to match all rows):
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123;
The result contains all 14 rows, as expected:
Due to the small number of rows, the optimizer chooses a nested loops join plan for this query:
The results are the same (and still correct) if we force a hash or merge join:
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (HASH JOIN);
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The Merge Join shown there is one-to-many with a sort on T1ID
required for table T2
.
Adding a descending index
All is well until one day an administrator adds a descending index on the SomeID
column of table 1:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC);
Our query continues to produce correct results when the optimizer chooses a Nested Loops or Hash Join, but it is a different story when a Merge Join is used.
The following coide still uses a query hint to force the Merge Join, but this is just a consequence of the low row counts in the example. The optimizer would naturally choose the same Merge Join plan with different table data.
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The execution plan is:
The optimizer has chosen to use the new index, but the query now produces only five rows of output:
What happened to the other 9 rows? For clarity, this result is incorrect. The data has not changed, so all 14 rows should be returned—as they are with a Nested Loops or Hash Join plan.
Cause and Explanation
The new nonclustered index on SomeID
is not declared as unique, so the clustered index key is added to all nonclustered index levels. SQL Server adds the T1ID
column (the clustered key) to the nonclustered index just as if we had created the index like this:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID);
Notice the lack of a DESC
qualifier on the silently-added T1ID
key. Index keys are ASC
by default. This is not a problem in itself (though it contributes). The second thing that happens to our index automatically is that it is partitioned in the same way the base table is. So, the full index specification, if we were to write it out explicitly, would be:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID ASC)
ON PS (T1ID);
An optimizer error
This is now quite a complex structure with keys in all sorts of different orders. It is complex enough for the query optimizer to make a mistake when reasoning about the sort order provided by the index.
To illustrate, consider the following simple query:
SELECT
T1ID,
PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID ASC;
The extra column will just show us which partition the current row belongs in. Otherwise, it is just a simple query that returns T1ID
values in ascending order WHERE SomeID = 123
. Unfortunately, the results are not what is specified by the query:
Incorrect results
The query specifies that T1ID
values should be returned in ascending order, but that is not what we get.
We do get values in ascending order per partition, but the partitions themselves are returned in reverse order. If the partitions were returned in ascending order (and the T1ID
values remained sorted within each partition as shown) the result would be correct.
The query plan shows that the optimizer was confused by the leading DESC
key of the index and thought it needed to read the partitions in reverse order for correct results:
The partition seek starts at the right-most partition (4) and proceeds backwards to partition 1.
You might wonder if we could fix the issue by explicitly sorting on partition number ASC
in the ORDER BY
clause:
SELECT
T1ID,
PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
PartitionID ASC, -- New!
T1ID ASC;
This query returns the same results (this is not a misprint or a copy/paste error):
The partition id is still in descending order (not ascending, as specified) and T1ID
is only sorted ascending within each partition. he optimizer really does seem to think that scanning the partitioned leading-descending-key index in a forward direction—but with partitions reversed—will result in the order specified by the query.
Final example
As a final example, consider:
SELECT
T1ID
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID DESC;
The results are:
Again, the T1ID
sort order within each partition is correctly descending, but the partitions are listed backwards (they go from 1 to 3 down the rows). If the partitions were returned in reverse order, the results would correctly be (14, 13, 12, 11, 10, 9, … 5, 4, 3, 2, 1).
Back to the Merge join
The cause of the incorrect results with the Merge Join query is now apparent:
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The Merge Join requires sorted inputs. The input from T2
is explicitly sorted by T1TD
so that is ok. The optimizer incorrectly reasons that the index on T1
can provide rows in T1ID
order. As we have seen, this is not the case. The Index Seek produces the same output as a query we have already seen:
SELECT
T1ID
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID ASC;
Only the first 5 rows are in T1ID
order. The next value (5) is certainly not in ascending order, and the Merge Join interprets this as end-of-stream rather than producing an error (I expected an assertion failure here). Anyway, the effect is that the Merge Join incorrectly finishes processing early.
As a reminder, the (incomplete) results are:
Conclusion
This is a serious bug—a straightforward index seek can return results that do not respect the ORDER BY
clause. Moreover, the optimizer’s internal reasoning seems to be broken for partitioned non-unique nonclustered indexes with a descending leading key.
Yes, this is a slightly unusual arrangement. But as we have seen, correct results can be suddenly replaced by incorrect results just because someone added a descending index. Remember the added index looked innocent enough: No explicit ASC/DESC
key mismatch, and no explicit partitioning.
The bug is not limited to merge joins. Potentially any query that involves a partitioned table and which relies on index sort order (explicit or implicit) could fall victim.
This bug exists in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. SQL Server 2005 used a different implementation model for partitioning (based on APPLY
) and does not suffer from this issue.
My Connect item for this issue is at Incorrect Results - Partitioned Nonclustered Index with Descending Key (web archive link).
Update
The fix for this issue is now available and documented in a KB 2892741.
Please note the fix requires a SQL Server code patch and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect results bug to be fixed under 4199. I asked for clarification on that and the response was:
Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.
Thanks for reading.