Monday, 01 July 2019

Some ANY Aggregate Transformations are Broken

Introduction

The ANY aggregate is not something we can write directly in Transact SQL. It is an internal only feature used by the query optimizer and execution engine.

I am personally quite fond of the ANY aggregate, so it was a bit disappointing to learn that it is broken in quite a fundamental way. The particular flavour of ‘broken’ I am referring to here is the wrong-results variety.

In this article, I take a look at two particular places where the ANY aggregate commonly shows up, demonstrate the wrong results problem, and suggest workarounds where necessary.

For background on the ANY aggregate, please see my previous post Undocumented Query Plans: The ANY Aggregate.

1. One Row Per Group Queries

This must be one of the most common day-to-day query requirements, with a very well-known solution. You probably write this sort of query every day, automatically following the pattern, without really thinking about it.

The idea is to number the input set of rows using the ROW_NUMBER window function, partitioned by the grouping column or columns. That is wrapped in a Common Table Expression or derived table, and filtered down to rows where the computed row number is equal to one. Since the ROW_NUMBER restarts at one for each group, this gives us the required one row per group.

There is no problem with that general pattern. The type of one row per group query that is subject to the ANY aggregate problem is the one where we do not care which particular row is selected from each group.

In that case, it is not clear which column should be used in the mandatory ORDER BY clause of the ROW_NUMBER window function. After all, we explicitly do not care which row is selected. One common approach is to reuse the PARTITION BY column(s) in the ORDER BY clause. This is where the problem might occur.

Example

Let us look at an example using a toy data set:

CREATE TABLE #Data
(
    c1 integer NULL,
    c2 integer NULL,
    c3 integer NULL
);

INSERT #Data
    (c1, c2, c3)
VALUES
    -- Group 1
    (1, NULL, 1),
    (1, 1, NULL),
    (1, 111, 111),
    -- Group 2
    (2, NULL, 2),
    (2, 2, NULL),
    (2, 222, 222);

The requirement is to return any one complete row of data from each group, where group membership is defined by the value in column c1.

Source data

Following the ROW_NUMBER pattern, we might write a query like the following (notice the ORDER BY clause of the ROW_NUMBER window function matches the PARTITION BY clause):

WITH 
    Numbered AS 
    (
        SELECT 
            D.*, 
            rn = ROW_NUMBER() OVER (
                PARTITION BY D.c1
                ORDER BY D.c1) 
        FROM #Data AS D
    )
SELECT
    N.c1, 
    N.c2, 
    N.c3
FROM Numbered AS N
WHERE
    N.rn = 1;

As presented, this query executes successfully, with correct results. The results are technically non-deterministic since SQL Server could validly return any one of the rows in each group. Nevertheless, if you run this query yourself, you are quite likely to see the same result I do:

One possible result

The execution plan depends on the version of SQL Server used, and does not depend on database compatibility level.

On SQL Server 2014 and earlier, the plan is:

Distinct Sort ANY Aggregate Plan

For SQL Server 2016 or later, you will see:

Plan without ANY aggregate

Both plans are safe, but for different reasons. The Distinct Sort plan contains an ANY aggregate, but the Distinct Sort operator implementation does not manifest the bug.

The more complex SQL Server 2016+ plan does not use the ANY aggregate at all. The Sort puts the rows into the order needed for the row numbering operation. The Segment operator sets a flag at the start of each new group. The Sequence Project computes the row number. Finally, the Filter operator passes on only those rows that have a computed row number of one.

The bug

To get incorrect results with this data set, we need to be using SQL Server 2014 or earlier, and the ANY aggregates need to be implemented in a Stream Aggregate or Eager Hash Aggregate operator (Flow Distinct Hash Match Aggregate does not produce the bug).

One way to encourage the optimizer to choose a Stream Aggregate instead of Distinct Sort is to add a clustered index to provide ordering by column c1:

CREATE CLUSTERED INDEX c ON #Data (c1);

After that change, the execution plan becomes:

ANY Aggregate in Stream Aggregate

The ANY aggregates are visible in the Properties window when the Stream Aggregate operator is selected:

ANY Aggregate

The result of the query is:

Wrong results

This is wrong. SQL Server has returned rows that do not exist in the source data. There are no source rows where c2 = 1 and c3 = 1 for example. As a reminder, the source data is:

Source data

The execution plan erroneously computes separate ANY aggregates for the c2 and c3 columns, ignoring nulls. Each aggregate independently returns the first non-null value it encounters, giving a result where the values for c2 and c3 come from different source rows. This is not what the original SQL query specification requested.

The same wrong result can be produced with or without the clustered index by adding an OPTION (HASH GROUP) hint to produce a plan with an Eager Hash Aggregate instead of a Stream Aggregate.

Conditions

This problem can only occur when multiple ANY aggregates are present, and the aggregated data contains nulls. As noted, the issue only affects Stream Aggregate and Eager Hash Aggregate operators; Distinct Sort and Flow Distinct are not affected.

SQL Server 2016 onward makes an effort to avoid introducing multiple ANY aggregates for the any one row per group row numbering query pattern when the source columns are nullable. When this happens, the execution plan will contain Segment, Sequence Project, and Filter operators instead of an aggregate. This plan shape is always safe, since no ANY aggregates are used.

Reproducing the bug in SQL Server 2016+

The SQL Server optimizer is not perfect at detecting when a column originally constrained to be NOT NULL might still produce a null intermediate value through data manipulations.

To reproduce this, we will start with a table where all columns are declared as NOT NULL:

IF OBJECT_ID(N'tempdb..#Data', N'U') IS NOT NULL
BEGIN
    DROP TABLE #Data;
END;

CREATE TABLE #Data
(
    c1 integer NOT NULL,
    c2 integer NOT NULL,
    c3 integer NOT NULL
);

CREATE CLUSTERED INDEX c ON #Data (c1);

INSERT #Data
    (c1, c2, c3)
VALUES
    -- Group 1
    (1, 1, 1),
    (1, 2, 2),
    (1, 3, 3),
    -- Group 2
    (2, 1, 1),
    (2, 2, 2),
    (2, 3, 3);

We can produce nulls from this data set in many ways, most of which the optimizer can successfully detect, and so avoid introducing ANY aggregates during optimization.

One way to add nulls that happens to slip under the radar is shown below:

SELECT
    D.c1,
    OA1.c2,
    OA2.c3
FROM #Data AS D
OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1
OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2;

That query produces the following output:

Query result

The next step is to use that query specification as the source data for the standard “any one row per group” query:

WITH
    SneakyNulls AS 
    (
        -- Introduce nulls the optimizer can't see
        SELECT
            D.c1,
            OA1.c2,
            OA2.c3
        FROM #Data AS D
        OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1
        OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2
    ),
    Numbered AS 
    (
        SELECT
            D.c1,
            D.c2,
            D.c3,
            rn = ROW_NUMBER() OVER (
                PARTITION BY D.c1
                ORDER BY D.c1) 
        FROM SneakyNulls AS D
    )
SELECT
    N.c1, 
    N.c2, 
    N.c3
FROM Numbered AS N
WHERE
    N.rn = 1;

On any version of SQL Server, that produces the following plan:

ANY aggregate plan

The Stream Aggregate contains multiple ANY aggregates, and the result is wrong. Neither of the returned rows appear in the source data set:

Wrong results

db<>fiddle online demo

Workaround

The only fully reliable workaround until this bug is fixed is to avoid the pattern where the ROW_NUMBER has the same column in the ORDER BY clause as is in the PARTITION BY clause.

When we do not care which one row is selected from each group, it is unfortunate that an ORDER BY clause is needed at all. One way to side-step the issue is to use a run time constant like ORDER BY @@SPID in the window function.

2. Non-deterministic update

The issue with multiple ANY aggregates on nullable inputs is not restricted to the any one row per group query pattern. The query optimizer can introduce an internal ANY aggregate in a number of circumstances. One of those case is a non-deterministic update.

A non-deterministic update is where the statement does not guarantee that each target row will be updated at most once. In other words, there are multiple source rows for at least one target row. The documentation explicitly warns about this:

Use caution when specifying the FROM clause to provide the criteria for the update operation.
The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.

To handle a non-deterministic update, the optimizer groups the rows by a key (index or RID) and applies ANY aggregates to the remaining columns. The basic idea there is to choose one row from multiple candidates, and use values from that row to perform the update. There are obvious parallels to the previous ROW_NUMBER issue, so it is no surprise that it is quite easy to demonstrate an incorrect update.

Unlike the previous issue, SQL Server currently takes no special steps to avoid multiple ANY aggregates on nullable columns when performing a non-deterministic update. The following therefore relates to all SQL Server versions, including SQL Server 2019 CTP 3.0.

Example

DECLARE @Target table
(
    c1 integer PRIMARY KEY, 
    c2 integer NOT NULL, 
    c3 integer NOT NULL
);

DECLARE @Source table 
(
    c1 integer NULL, 
    c2 integer NULL, 
    c3 integer NULL, 
    
    INDEX c CLUSTERED (c1)
);

INSERT @Target 
    (c1, c2, c3) 
VALUES 
    (1, 0, 0);

INSERT @Source 
    (c1, c2, c3) 
VALUES 
    (1, 2, NULL),
    (1, NULL, 3);

UPDATE T
SET T.c2 = S.c2,
    T.c3 = S.c3
FROM @Target AS T
JOIN @Source AS S
    ON S.c1 = T.c1;

SELECT * FROM @Target AS T;

db<>fiddle online demo

Logically, this update should always produce an error: The target table does not allow nulls in any column. Whichever matching row is chosen from the source table, an attempt to update column c2 or c3 to null must occur.

Unfortunately, the update succeeds, and the final state of the target table is inconsistent with the supplied data:

I have reported this as a bug. The work around is to avoid writing non-deterministic UPDATE statements, so ANY aggregates are not needed to resolve the ambiguity.

As mentioned, SQL Server can introduce ANY aggregates in more circumstances than the two examples given here. If this happens when the aggregated column contains nulls, there is the potential for wrong results.

Thanks for reading.

For an update on this, see my 2024 article Current State of the ANY Aggregate Transformation