Some ANY Aggregate Transformations are Broken
Introduction
The ANY
aggregate is not something we can write directly in Transact SQL. It is an internal only feature used by the query optimizer and execution engine.
I am personally quite fond of the ANY
aggregate, so it was a bit disappointing to learn that it is broken in quite a fundamental way. The particular flavour of ‘broken’ I am referring to here is the wrong-results variety.
In this article, I take a look at two particular places where the ANY
aggregate commonly shows up, demonstrate the wrong results problem, and suggest workarounds where necessary.
For background on the ANY
aggregate, please see my previous post Undocumented Query Plans: The ANY Aggregate.
1. One Row Per Group Queries
This must be one of the most common day-to-day query requirements, with a very well-known solution. You probably write this sort of query every day, automatically following the pattern, without really thinking about it.
The idea is to number the input set of rows using the ROW_NUMBER
window function, partitioned by the grouping column or columns. That is wrapped in a Common Table Expression or derived table, and filtered down to rows where the computed row number is equal to one. Since the ROW_NUMBER
restarts at one for each group, this gives us the required one row per group.
There is no problem with that general pattern. The type of one row per group query that is subject to the ANY
aggregate problem is the one where we do not care which particular row is selected from each group.
In that case, it is not clear which column should be used in the mandatory ORDER BY
clause of the ROW_NUMBER
window function. After all, we explicitly do not care which row is selected. One common approach is to reuse the PARTITION BY
column(s) in the ORDER BY
clause. This is where the problem might occur.
Example
Let us look at an example using a toy data set:
CREATE TABLE #Data
(
c1 integer NULL,
c2 integer NULL,
c3 integer NULL
);
INSERT #Data
(c1, c2, c3)
VALUES
-- Group 1
(1, NULL, 1),
(1, 1, NULL),
(1, 111, 111),
-- Group 2
(2, NULL, 2),
(2, 2, NULL),
(2, 222, 222);
The requirement is to return any one complete row of data from each group, where group membership is defined by the value in column c1
.
Following the ROW_NUMBER
pattern, we might write a query like the following (notice the ORDER BY
clause of the ROW_NUMBER
window function matches the PARTITION BY
clause):
WITH
Numbered AS
(
SELECT
D.*,
rn = ROW_NUMBER() OVER (
PARTITION BY D.c1
ORDER BY D.c1)
FROM #Data AS D
)
SELECT
N.c1,
N.c2,
N.c3
FROM Numbered AS N
WHERE
N.rn = 1;
As presented, this query executes successfully, with correct results. The results are technically non-deterministic since SQL Server could validly return any one of the rows in each group. Nevertheless, if you run this query yourself, you are quite likely to see the same result I do:
The execution plan depends on the version of SQL Server used, and does not depend on database compatibility level.
On SQL Server 2014 and earlier, the plan is:
For SQL Server 2016 or later, you will see:
Both plans are safe, but for different reasons. The Distinct Sort plan contains an ANY
aggregate, but the Distinct Sort operator implementation does not manifest the bug.
The more complex SQL Server 2016+ plan does not use the ANY
aggregate at all. The Sort puts the rows into the order needed for the row numbering operation. The Segment operator sets a flag at the start of each new group. The Sequence Project computes the row number. Finally, the Filter operator passes on only those rows that have a computed row number of one.
The bug
To get incorrect results with this data set, we need to be using SQL Server 2014 or earlier, and the ANY
aggregates need to be implemented in a Stream Aggregate or Eager Hash Aggregate operator (Flow Distinct Hash Match Aggregate does not produce the bug).
One way to encourage the optimizer to choose a Stream Aggregate instead of Distinct Sort is to add a clustered index to provide ordering by column c1
:
CREATE CLUSTERED INDEX c ON #Data (c1);
After that change, the execution plan becomes:
The ANY
aggregates are visible in the Properties window when the Stream Aggregate operator is selected:
The result of the query is:
This is wrong. SQL Server has returned rows that do not exist in the source data. There are no source rows where c2 = 1
and c3 = 1
for example. As a reminder, the source data is:
The execution plan erroneously computes separate ANY
aggregates for the c2
and c3
columns, ignoring nulls. Each aggregate independently returns the first non-null value it encounters, giving a result where the values for c2
and c3
come from different source rows. This is not what the original SQL query specification requested.
The same wrong result can be produced with or without the clustered index by adding an OPTION (HASH GROUP)
hint to produce a plan with an Eager Hash Aggregate instead of a Stream Aggregate.
Conditions
This problem can only occur when multiple ANY
aggregates are present, and the aggregated data contains nulls. As noted, the issue only affects Stream Aggregate and Eager Hash Aggregate operators; Distinct Sort and Flow Distinct are not affected.
SQL Server 2016 onward makes an effort to avoid introducing multiple ANY
aggregates for the any one row per group row numbering query pattern when the source columns are nullable. When this happens, the execution plan will contain Segment, Sequence Project, and Filter operators instead of an aggregate. This plan shape is always safe, since no ANY
aggregates are used.
Reproducing the bug in SQL Server 2016+
The SQL Server optimizer is not perfect at detecting when a column originally constrained to be NOT NULL
might still produce a null intermediate value through data manipulations.
To reproduce this, we will start with a table where all columns are declared as NOT NULL
:
IF OBJECT_ID(N'tempdb..#Data', N'U') IS NOT NULL
BEGIN
DROP TABLE #Data;
END;
CREATE TABLE #Data
(
c1 integer NOT NULL,
c2 integer NOT NULL,
c3 integer NOT NULL
);
CREATE CLUSTERED INDEX c ON #Data (c1);
INSERT #Data
(c1, c2, c3)
VALUES
-- Group 1
(1, 1, 1),
(1, 2, 2),
(1, 3, 3),
-- Group 2
(2, 1, 1),
(2, 2, 2),
(2, 3, 3);
We can produce nulls from this data set in many ways, most of which the optimizer can successfully detect, and so avoid introducing ANY
aggregates during optimization.
One way to add nulls that happens to slip under the radar is shown below:
SELECT
D.c1,
OA1.c2,
OA2.c3
FROM #Data AS D
OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1
OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2;
That query produces the following output:
The next step is to use that query specification as the source data for the standard “any one row per group” query:
WITH
SneakyNulls AS
(
-- Introduce nulls the optimizer can't see
SELECT
D.c1,
OA1.c2,
OA2.c3
FROM #Data AS D
OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1
OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2
),
Numbered AS
(
SELECT
D.c1,
D.c2,
D.c3,
rn = ROW_NUMBER() OVER (
PARTITION BY D.c1
ORDER BY D.c1)
FROM SneakyNulls AS D
)
SELECT
N.c1,
N.c2,
N.c3
FROM Numbered AS N
WHERE
N.rn = 1;
On any version of SQL Server, that produces the following plan:
The Stream Aggregate contains multiple ANY
aggregates, and the result is wrong. Neither of the returned rows appear in the source data set:
Workaround
The only fully reliable workaround until this bug is fixed is to avoid the pattern where the ROW_NUMBER
has the same column in the ORDER BY
clause as is in the PARTITION BY
clause.
When we do not care which one row is selected from each group, it is unfortunate that an ORDER BY
clause is needed at all. One way to side-step the issue is to use a run time constant like ORDER BY @@SPID
in the window function.
2. Non-deterministic update
The issue with multiple ANY
aggregates on nullable inputs is not restricted to the any one row per group query pattern. The query optimizer can introduce an internal ANY
aggregate in a number of circumstances. One of those case is a non-deterministic update.
A non-deterministic update is where the statement does not guarantee that each target row will be updated at most once. In other words, there are multiple source rows for at least one target row. The documentation explicitly warns about this:
Use caution when specifying the FROM clause to provide the criteria for the update operation.
The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.
To handle a non-deterministic update, the optimizer groups the rows by a key (index or RID) and applies ANY
aggregates to the remaining columns. The basic idea there is to choose one row from multiple candidates, and use values from that row to perform the update. There are obvious parallels to the previous ROW_NUMBER
issue, so it is no surprise that it is quite easy to demonstrate an incorrect update.
Unlike the previous issue, SQL Server currently takes no special steps to avoid multiple ANY
aggregates on nullable columns when performing a non-deterministic update. The following therefore relates to all SQL Server versions, including SQL Server 2019 CTP 3.0.
Example
DECLARE @Target table
(
c1 integer PRIMARY KEY,
c2 integer NOT NULL,
c3 integer NOT NULL
);
DECLARE @Source table
(
c1 integer NULL,
c2 integer NULL,
c3 integer NULL,
INDEX c CLUSTERED (c1)
);
INSERT @Target
(c1, c2, c3)
VALUES
(1, 0, 0);
INSERT @Source
(c1, c2, c3)
VALUES
(1, 2, NULL),
(1, NULL, 3);
UPDATE T
SET T.c2 = S.c2,
T.c3 = S.c3
FROM @Target AS T
JOIN @Source AS S
ON S.c1 = T.c1;
SELECT * FROM @Target AS T;
Logically, this update should always produce an error: The target table does not allow nulls in any column. Whichever matching row is chosen from the source table, an attempt to update column c2
or c3
to null must occur.
Unfortunately, the update succeeds, and the final state of the target table is inconsistent with the supplied data:
I have reported this as a bug. The work around is to avoid writing non-deterministic UPDATE
statements, so ANY
aggregates are not needed to resolve the ambiguity.
As mentioned, SQL Server can introduce ANY
aggregates in more circumstances than the two examples given here. If this happens when the aggregated column contains nulls, there is the potential for wrong results.
Thanks for reading.
For an update on this, see my 2024 article Current State of the ANY Aggregate Transformation