Tuesday, 17 September 2024

Why a Self-Join Requires Halloween Protection

This article was originally published on 𝕏.

I was asked recently why Halloween Protection was needed for data modification statements that include a self-join of the target table. This gives me a chance to explain, while also covering some interesting product bug history from the SQL Server 7 and 2000 days.

If you already know all there is to know about the Halloween Problem as it applies to SQL Server, you can skip the background section.

Background

Most descriptions of the Halloween Problem talk about rows moving around in an index. That’s probably the most common understanding of the issue.

A classic example is giving all employees in a particular salary range a 5% pay rise. With an index on the Salary column used to locate qualifying employees, the change might cause the row to move further down the index. If this happens, we will encounter the row again later in the range scan and apply a second salary increment. Good news for the affected employees, but that’s not what the SQL statement was supposed to do.

It’s an easily understood example, but the scope of the Halloween Problem is broader.

Logical phase separation

A SQL database change (insert, update, delete, or any combination of the three in a merge) is required by the SQL Standard to achieve the same end state as if all reads completed before the first write was issued.

Using the example above, there’s no way for an employee to receive multiple pay rises if we first read all qualifying rows, compute all the 5% increments, and only then start to apply changes to the database. This idea is called phase separation. It means the reading phase is fully completed before the writing phase begins.

That’s all very well, but for this to work we need to store a complete description of the changes somewhere before we can start to apply them. This storage operation has a cost associated with it, which can be significant when a large volume of data is involved.

The simplest implementation of phase separation in SQL Server involves an Eager Table Spool operator placed at the border of the reading and writing sides of an execution plan. The spool uses tempdb to hold the information needed to locate and change the affected rows.

Being smarter

SQL Server improves on this by recognising that existing plan operators on the reading side of the plan might already provide some protection through their normal operation. For example, a Sort fully separates its input from its output because the first sorted row can’t be produced until the entire input has been seen. Hash operators are also blocking on at least their build input.

By carefully assessing the protection qualities of all possible plan operators (including any different modes of operation they might have), SQL Server can often avoid introducing a full Eager Table Spool at the read/write border.

If the existing operators provide only partial protection, the optimizer might decide to include just the extra little bit of protection needed at the appropriate place in the plan. This might mean, for example, choosing a hash join instead of nested loops, adding a seemingly redundant sort, or introducing an eager spool in a place where relatively few rows are expected.

The optimizer also takes account of the maximum number of rows that might be affected by the operation. When only a single row can possibly be changed, no protection is needed. A single change can never affect subsequent operations, since there won’t be any. The guarantee of a single row might come in several guises, from an explicit TOP (1) or FETCH FIRST 1 ROW ONLY clause to an equality comparison on the key(s) of a unique index.

After its deliberations, the optimizer may still conclude that a full Eager Table Spool at the read/write boundary is the best option. As usual, it may explore many different alternatives before choosing the one that appears lowest cost according to its model.

That’s all marvellous of course, but it can be difficult to reliably assess when—and to what degree—protection is required in the first place. For instance, one might think that a plan that doesn’t move rows around in an index needs no protection from the Halloween Problem. One would be incorrect.

Self Joins

To illustrate the point, I’m going to use a table borrowed from an old Microsoft Knowledge Base article (more on that later).

IF OBJECT_ID(N'dbo.Test', 'U') IS NOT NULL
BEGIN
    DROP TABLE dbo.Test;
END;
GO
CREATE TABLE dbo.Test
(
    id integer NULL, 
    pid integer NULL, 
    fn varchar(256) NULL, 
    rn varchar(8) NULL
);
GO
CREATE UNIQUE CLUSTERED INDEX 
    idx_c_id 
    ON dbo.Test (id);

I’m not going to add any sample data because I’m only concerned with showing the execution plans SQL Server produces in a few different cases. The lack of data isn’t a problem because the cached plans need to be correct for whatever data might be added in future. SQL Server will include protection where necessary.

Update plan examples

Let’s take a look at the plans for updating the fn column to NULL in all rows, a change that doesn’t affect the keys of the clustered index. There are a few different ways to write this simple query. The following two forms are exactly equivalent. Some people just prefer to always update an alias because it’s safest when there are multiple tables involved.

-- Simplest
UPDATE dbo.Test
SET fn = NULL;

-- Exactly equivalent alias update
UPDATE T
SET fn = NULL
FROM dbo.Test AS T;

As expected, the plans are the same:

Unrestricted update plans

The situation changes if we join the table to itself:

UPDATE This
SET fn = NULL
FROM dbo.Test AS This
JOIN dbo.Test AS Parent
    ON Parent.id = This.pid;

The plan now includes an Eager Table Spool at the read/write border:

Self-join requires protection from the Halloween Problem

To show a different protection option, hint a merge join:

UPDATE This
SET fn = NULL
FROM dbo.Test AS This
JOIN dbo.Test AS Parent
    ON Parent.id = This.pid
OPTION (MERGE JOIN);

This produces a Sort operator on both join inputs instead of a Spool:

Protection provided by sorts instead of a spool

Finally, let’s change the statement so at most one row can be updated:

UPDATE This
SET fn = NULL
FROM dbo.Test AS This
JOIN dbo.Test AS Parent
    ON Parent.id = This.pid
WHERE
    This.id = 1;

The statement guarantees at most one row from the table instance aliased as This because the id column is a unique key and the comparison is equality. Similarly, the Parent alias instance also performs an equality comparison on its id column at the join.

These facts are enough for the optimizer to conclude that at most one row will be updated by this statement. It introduces a Top (1) operator in place of a spool:

Protection provided by a Top operator

Technically, the Top is unnecessary. Its presence is an implementation quirk.

Explanation

None of these plans modify the keys of an index, so why is protection needed? Rows aren’t moving around at all, so where’s the problem?

Remember, logical phase separation requires SQL Server to apply the same effects as if all reads had completed before the first write started. In particular, phase separation means it’s not possible to see the results of a prior data change made by the current statement. The physical execution plan SQL Server produces can’t allow that to happen either.

Self-joins break phase separation because a row we update now might be referred to later (via the self-join) by a subsequent row. If we use a value previously updated by the current statement, we might produce a result that logically could never happen with full read/write separation.

Another way to think about this is that modifying the keys of an index you’re reading from might allow you to encounter a row more than once while reading. A self-join has a similar effect because it allows you to look at any other row in the same table, including ones you’ve already modified.

It can be surprisingly complicated to assess which data changing operations require protection from the Halloween Problem, and how much protection particular plan operators provide due to their physical characteristics. The latest example of a problem in this area caused incorrect results in SQL Server 2014 to 2019 inclusive due to a parallel row mode Eager Table Spool on the probe side of a batch mode Hash Join. See my previous article for a full analysis.

An Old Bug

At some point while SQL Server 7 and 2000 were current, people discovered you could write update queries in a particular way to achieve a kind of recursive update.

For example, given a table that self-references parent rows (like an employee/manager relationship) it was possible to materialise the full path from a given record to the root using an UPDATE statement. In the employee/manager example, this would mean constructing the path through the organisational hierarchy from the employee to whomever was at the very top of the tree.

This was seen by some at the time as a clever and useful coding trick. The only problem was that it was a bit unreliable—sometimes it would work and sometimes it wouldn’t, depending on the exact shape of the execution plan SQL Server produced.

Thinking this was a bug of some kind, the behaviour was reported presumably with a view to getting reliable recursive updates. It was indeed a bug, but not in the direction those people were hoping.

The correct behaviour (because of phase separation) was not to perform a recursive update at all. This upset a few people at the time. Remember, recursive common table expressions hadn’t been added to the product yet, recursive scalar functions were limited to a depth of 32 and writing equivalent procedural T-SQL with WHILE loops often resulted in poor performance.

There were more than a few Halloween Problem bugs fixed in SQL Server 7 and 2000 (or even earlier). This particular one was documented under KB 285870 (archive link because Microsoft have deleted the original). It was written at a time when Microsoft provided useful details and sometimes even a reproduction of the issue addressed. Imagine that.

Demo

KB 285870 includes such a repro. I have reformatted the T-SQL slightly below and changed varchar(8) to varchar(11) for purely pedantic reasons:

IF OBJECT_ID(N'dbo.Test', 'U') IS NOT NULL
BEGIN
    DROP TABLE dbo.Test;
END;
GO
SET NOCOUNT ON;
GO
CREATE TABLE dbo.Test
(
    id integer NULL, 
    pid integer NULL, 
    fn varchar(256) NULL, 
    rn varchar(11) NULL
);
GO
INSERT INTO 
    dbo.Test (id, pid, fn, rn)
VALUES
    (0, NULL, 'root', 'root');
GO
DECLARE @c integer;
SET @c = 1;

WHILE @c < 10
BEGIN
    INSERT INTO 
        dbo.Test (id, pid, fn, rn)
    VALUES
        (@c, @c-1, NULL, CAST(@c as varchar(11)));

    SET @c = @c + 1;
END;
GO
CREATE UNIQUE CLUSTERED INDEX 
    idx_c_id 
    ON dbo.Test (id);

Hash join

The ‘recursive update’ code below is exactly as written in the KB:

update test
set fn = parent.fn + '/' + test.rn
from test(index=0) , test parent(index=0)
where test.pid = parent.id and test.fn is NULL --  and parent.fn <> ''
option(hash join,force order)

Before and after view of the table data in the old SQL Query Analyzer tool (no SSMS back then):

SQL Query Analyzer results

This result is incorrect because changes to later rows use the updated fn value from the prior row to construct the full path to the root. You can see how this could be a useful result, but it does depend on the order in which SQL Server updates the rows. It also breaks the phase separation requirement, of course.

For interest, the SQL Server 2000 RTM execution plan is:

No XML here. Only pictures and very limited tooltips.

The upper instance of the test table is the one without an alias in the update statement. It has complete phase separation courtesy of being on the blocking build input to the Hash Match Inner Join operator.

The lower instance of the test table is the one aliased as Parent. The probe input to the hash join is not blocking, so joined rows stream through as they are received from the lower clustered index scan. The optimizer has not required this to be an ordered scan, but rows are overwhelmingly likely to be produced in id order because the table is so small.

The Top operator in this plan has not been added to protect against the Halloween Problem. It is a ‘rowcount top’, used in old SQL Server versions to honour the session ROWCOUNT setting. It passes along each row it receives from the Hash Match Inner Join to the Compute Scalar and Clustered Index Update.

The problem is in the streaming (non-blocking) processing of the probe side rows. Because rows are read in id order and the join is on pid = id, the scan returns the post-update value from the prior row. This gives the apparently recursive result.

Nested loops join

Changing the query hints to OPTION (LOOP JOIN, FORCE ORDER) produces correct results:

Correct results based on pre-update values

Notice the Eager Table Spool on the lower Parent alias scan. Reading the full set of rows into temporary storage before the join produces its first row ensures that only pre-update values of the fn column are used by the Clustered Index Update. The result is not what the query writer was after, but it is correct.

Changing the query hint to OPTION (LOOP JOIN) alone produces incorrect results again:

Incorrect nested loop join results without FORCE ORDER

The plan looks the same, but this time the Parent alias is the upper table. The optimizer added an Eager Table Spool for protection but put it in the wrong place!

Merge join

With an OPTION (MERGE JOIN) or OPTION (MERGE JOIN, FORCE ORDER) query hint, correct results are produced due to blocking sorts on both inputs to the join:

Correct results with merge join and sorts on both inputs

The FORCE ORDER hint affects which alias of the table is uppermost but with blocking sorts on both inputs, this doesn’t matter at all.

We can still get incorrect results with a merge join, but we need to remove the (index = 0) table hints that force an unordered scan and use OPTION (MERGE JOIN) without the FORCE ORDER:

Incorrect merge join results

In this variation, the Parent alias is uppermost and not protected by a blocking sort. The update sees its own changes and recursive results are produced.

As mentioned earlier, if you wanted to write reliable parent-child path construction in SQL Server 2000, you’d need a recursive scalar function or procedural code like:

DECLARE 
    @id integer,
    @fn varchar(256);

SELECT
    @id = 9, -- The row to start with
    @fn = '';

WHILE @@ROWCOUNT > 0
BEGIN
    -- Find the next parent
    SELECT 
        @id = T.pid,
        @fn = '/' + T.rn + @fn
    FROM dbo.Test AS T
    WHERE 
        T.id = @id;
END;

-- Remove the leading /
SELECT path = STUFF(@fn, 1, 1, '');

This correctly produces the result root/1/2/3/4/5/6/7/8/9 for the row where id = 9.

Bug Summary

I hope you can see how confusing this all was at the time. SQL Server would produce different results for the same UPDATE statement depending on the execution plan chosen (hints are used to make the demo repeatable but in the real world, plan changes due to recompilations would often result in behavioural changes).

The underlying issue was an incorrect assessment of the Halloween Problem protection needed for self-joins, along with some issues around same-table alias matching in general. This particular bug was first fixed in Microsoft SQL Server 7.0 Service Pack 4 and SQL Server 2000 Service Pack 1, so you won’t be able to reproduce it on later versions.

This was a rich time for Halloween Problem bugs. Separate KB articles were written forself-inserts and deletes with a self-join, among many others.

Thanks for reading. I hope you enjoyed this part of the history of Halloween protection in SQL Server and now understand how self-joins can create problems in data modification statement.