Monday, 24 February 2025

How a Lazy Index Spool Works

Introduction

I had a question from a reader of my article, Nested Loops Joins and Performance Spools, about the section describing the Apply Lazy Index Spool. The question was about how the spool behaves when a cached result returns no rows.

Quoting from the linked article:

The Lazy Index Spool operator is only available with apply.

The index spool maintains a worktable that is not truncated when outer reference values change. Instead, new data is added to the existing cache, indexed by the outer reference values. A lazy index spool differs from a lazy table spool in that it can replay results from any prior loop iteration, not just the most recent one.

This mechanism is easy to understand when a prior iteration cached a result that the spool can replay (a rewind). Presumably (the questioner says), when the cached result is NULL, that is cached as well.

The question is: What happens when the prior iteration returned no rows at all? Does the spool somehow cache the empty result (being distinct from a NULL), or does it execute the spool’s input subtree again (effectively, a rebind)?

This is an insightful question and deserves a proper response.

Conditions

I gave the necessary conditions to produce an Apply Lazy Index Spool in the prior article:

The optimizer requires:

Some duplicate join values on the outer input.

An equality join predicate (or a logical equivalent the optimizer understands, such as x <= y AND x >= y).

A guarantee that the outer references are unique below the proposed lazy index spool.

An interesting restatement of these requirements is that the spool can cache a maximum of one result row for each value of the correlated apply parameter(s). In other words, the correlated apply parameters form a key for the spool’s cache.

Demo

To investigate how the spool functions internally, we ideally need to see the contents of the spool’s worktable.

There are a couple of ways we could achieve this, but the most convenient is to use an undocumented trace flag. Apologies to anyone who can’t set trace flags, but that’s what you get for invoking the awesome power of the cloud.

I’m going to use one of my favourite undocumented trace flags—one I don’t think I have revealed before. It is trace flag 600, which prints a row image for inserts to a b-tree structure (not heaps). The spool’s worktable is organized as a clustered index, so this will work perfectly here.

The demo is a slight variation on the one used in the prior article:

CREATE TABLE #S 
(
    c1 char(4) NOT NULL
);
GO
INSERT #S (c1)
VALUES
    -- Some duplicates
    ('aaaa'), 
    ('bbbb'), 
    ('cccc'), 
    ('dddd'), 
    ('aaaa'),
    ('bbbb');
GO
DBCC TRACEON (600, 3604) 
    WITH NO_INFOMSGS;

SELECT
    T1.c1, 
    CA.x
FROM #S AS T1
CROSS APPLY 
(
    SELECT
        x = MAX(U.x)
    FROM 
    (
        SELECT T1.c1
        UNION
        SELECT T1.c1
    ) AS U (x)
    WHERE
        U.x > 'd'
    GROUP BY 
        ()
) AS CA;

DBCC TRACEOFF (600, 3604)
    WITH NO_INFOMSGS;

DROP TABLE #S;

The query itself is just complex enough to prevent the optimizer from transforming it to a simpler form that does not require a spool, while meeting all the other requirements.

One key aspect is the GROUP BY () clause. This ensures the MAX aggregate behaves as a vector aggregate, rather than as a scalar aggregate. If you need a refresher on that, see my article Fun with Scalar and Vector Aggregates.

In SQL Server, a scalar aggregate always returns a non-empty result for an empty input (NULL for the MAX case); a vector aggregate returns no rows at all, which is what we want here.

Overall, the effect is that the APPLY above returns one row containing ‘dddd’ when the correlated parameter value is ‘dddd’. It returns an empty result set for all other values.

The results are:

Demo results

The execution plan is:

Lazy index spool plan Lazy index spool plan

Plan analysis

Unfortunately, the Actual Rebinds and Actual Rewinds properties of a Lazy Index Spool are not reliable due to implementation details.

The spool in the plan above reports six rebinds and zero rewinds. This reflects the fact that the correlated parameter value changed six times and matched the immediately prior value zero times:

Lazy Index Spool properties Lazy Index Spool properties

This outcome corresponds to the Table Scan producing rows in the order they were inserted:

‘aaaa’ — first value = rebind.
‘bbbb’ — not the same = rebind.
‘cccc’ — not the same = rebind.
‘dddd’ — not the same = rebind.
‘aaaa’ — matches a previous value, but not the prior one, reported as a rebind.
‘bbbb’ — matches a previous value, but not the prior one, reported as a rebind.

These rebind and rewind values would be correct for a Table Spool, which only caches the prior result and truncates its worktable on a rebind. The numbers are not correct for a Lazy Index Spool.

Actual behaviour

We can see the Lazy Index Spool behaved differently by looking at the Number of Executions reported by operators in its input tree, for example the Stream Aggregate.

This reports four separate executions, which makes sense because there are four unique values in the driving set:

Stream Aggregate properties Stream Aggregate properties

We can infer from this that the Lazy Index Spool did indeed cache results for ‘aaaa’, ‘bbbb’, ‘cccc’, and ‘dddd’ and reuse those empty results for the later duplicated values ‘aaaa’ and ‘bbbb’.

If the spool’s properties worked correctly, they would report four rebinds and two rewinds for this example.

Worktable contents

While the plan analysis strongly suggests that empty results were cached and reused by the Lazy Index Spool, we need to look at the rows stored in its worktable for confirmation.

The trace flag output is:

Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 19

Memory Dump @0x000000BC5EBF6E70

0000000000000000:   10001000 61616161 00000000 00000000 040004    ....aaaa...........
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 19

Memory Dump @0x000000BC5EBF6E70

0000000000000000:   10001000 62626262 704a0740 00000000 040004    ....bbbbpJ.@.......
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 19

Memory Dump @0x000000BC5EBF6E70

0000000000000000:   10001000 63636363 704a0740 00000000 040004    ....ccccpJ.@.......
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP    Record Size = 19

Memory Dump @0x000000BC5EBF6E70

0000000000000000:   10001000 64646464 64646464 01000000 040000    ....dddddddd.......

As a side note, the reported rowsetid is useful if you want to use DBCC PAGE to view the worktable’s pages after execution. They are no longer allocated to the spool of course, but they are very likely to stick around in the buffer pool, and so be visible through sys.dm_os_buffer_descriptors where the database_id will be 2 (for tempdb) and the allocation_unit_id is the rowsetid.

More importantly, we can directly see the row image for rows inserted to the spool. The format isn’t as friendly as DBCC PAGE style 3, but that isn’t available with TF 600—and simply not possible when viewing pages after the event because the metadata describing the row layout was deleted with the spool.

Row Image Decoding

That leaves us with decoding the row image by hand. The FixedVar format is well known and described, for example in SQL Server Storage Internals 101 by Mark S Rasmussen.

FixedVar format

Let’s look at the first row image in detail:

10001000 61616161 00000000 00000000 040004

The first two bytes are Status Bits A & B.
The next two bytes are 1000, the length of the fixed-length portion. Remembering that Windows uses little endian memory format, with the least significant byte first, the value is 0010 = 16 bytes.
The next four bytes are 61616161 = ‘aaaa’. This is the value of the outer reference parameter value used as part of the spool index’s key.
The next four bytes are all zero. This is the cached execution subtree result for the key value.
The next four bytes are also all zero. I will describe these shortly.
The next two bytes are 0400, which in little endian format means 0004. There are four columns in the fixed-length portion of the row.
The final byte is the NULL bitmap. The value 04 = 0100 in binary, meaning the third column is NULL. The other columns are all not NULL.

Number of columns

There are a few questions here. The most obvious one is that the metadata indicates the row has four fixed-length columns, but I only described three above.

The answer to this first puzzle is that the general spool worktable is structured as a clustered index with a zero-length key column. Rows are distinguished as necessary by a uniquifier, as is normal for a non-unique clustered index.

Stored values

The original spools (rowcount, table, and index) share a common base, which uses this zero-length key with an optional uniquifier. In the Lazy Index Spool case, the index key(s) to the spool are added after the invisible zero-length one.

The columns in the worktable are therefore:

Zero length clustering key prefix (invisible because it is zero length).
Spool key column (c1 outer reference value) e.g. ‘aaaa’.
Matching cached value from the spool’s subtree.
The uniquifier.

The NULL bitmap in the row image indicates that only column 3 is NULL. This is the single cached value in the spool for the corresponding outer reference value key. The value stored in the fixed-length area is not important when the bitmap inidcates the value is NULL. In this case, 00000000 happens to be stored.

The NULL third column could mean that the spool’s subtree returned a NULL for the associated key, or that the subtree returned no row at all. This is exactly the case we are looking to distinguish.

The uniquifier for this row is 00000000.

Note in passing that this is an interesting example of the uniquifier being written to the fixed-length portion of the row. The uniquifier is normally a variable-length column in user tables because it isn’t needed for non-duplicate clustering key values.

Other rows

The second and third row images are very similar, differing mainly in the spool key storing ‘bbbb’ (62626262) or ‘cccc’ (63636363):

10001000 62626262 704a0740 00000000 040004 -- 'bbbb'
10001000 63636363 704a0740 00000000 040004 -- 'cccc'

The value stored in the NULL third column (704a0740) is unimportant. On row insertions after the first, the register holding the value that ends up being written for the third column happens to be non-zero.

Remember, the invisible zero-length key is the first column, the spool key comes next, followed by the cached value. The uniquifier is the last of the four columns.

The final row in the spool’s worktable (‘dddd’) is the only one that returns a result:

10001000 64646464 64646464 01000000 040000 -- 'dddddd'

The third column now contains 64646464, indicating that ‘dddd’ is the cached result for key ‘dddd’. The NULL bitmap is now all zeroes—no NULL columns are present in this row because there is a non-NULL cached result.

The important observation is that the uniquifier now contains 00000001. This indicates the spool has a valid cached row value for this row’s key. For all other rows, where the result was empty, the uniquifier was 00000000—an invalid value, since uniquifiers start at one and are only ever positive.

The other rows in the worktable indicate that the outer reference value has been seen, but the result is empty (not NULL) because the uniquifier has a zero value. The spool therefore does not need to execute the subtree again. It can rewind to return ‘no rows’ for the repeated values ‘aaaa’ and ‘bbbb’.

The uniquifier is how the spool differntiates a NULL cached result from an empty one.

Cached NULLs

For completeness, let’s run the test script again without the GROUP BY () clause. This means the MAX is now a scalar aggrgate, which will return NULL on an empty input rather than no rows at all.

CREATE TABLE #S 
(
    c1 char(4) NOT NULL
);
GO
INSERT #S (c1)
VALUES
    -- Some duplicates
    ('aaaa'), 
    ('bbbb'), 
    ('cccc'), 
    ('dddd'), 
    ('aaaa'),
    ('bbbb');
GO
DBCC TRACEON (600, 3604) 
    WITH NO_INFOMSGS;

SELECT
    T1.c1, 
    CA.x
FROM #S AS T1
CROSS APPLY 
(
    SELECT
        x = MAX(U.x)
    FROM 
    (
        SELECT T1.c1
        UNION
        SELECT T1.c1
    ) AS U (x)
    WHERE
        U.x > 'd'
    --GROUP BY 
    --    ()
) AS CA;

DBCC TRACEOFF (600, 3604)
    WITH NO_INFOMSGS;

DROP TABLE #S;

The results are:

Results with a scalar aggregate Scalar aggregate results

The execution plan is the same in all respects, including reported rewinds and rebinds. Showplan output does not currently distinguish scalar from vector aggregates.

There are still four rows in the spool, representing the four distinct correlated parameter values encountered. The row images are:

10001000 61616161 00000000 01000000 040004 -- aaaa
10001000 62626262 708a533f 01000000 040004 -- bbbb
10001000 63636363 708a533f 01000000 040004 -- cccc
10001000 64646464 64646464 01000000 040000 -- dddd

The NULL bitmap indicates the first three rows have a NULL cached value. The 00000001 uniquifier indicates this is a valid literal NULL, not an empty set.

The fourth row has a non-NULL cached value (‘dddd’) with the 00000001 uniquifier again indicating a valid cached value is present.

The spool was able to reuse (rewind) the stored NULL values for the repeated interations with c1 outer reference values of ‘aaaa’ and ‘bbbb’.

Final Thoughts

This is an interesting demo for several reasons. Being able to see the inserted row image using the undocumented trace flag is generally useful and frequently easier than other methods, even with the manual decoding.

Storing the uniquifier in the fixed-length portion of the row is a great example of SQL Server breaking with observed norms for internal purposes. I don’t believe the uniquifier is documented anywhere to be variable length in user tables, but that has been widely assumed over the years by experimenters.

Of course, it could be more efficient to store a ‘row valid’ flag as a one byte fixed length value, but the underlying spool architecture already requires a four byte uniquifier for general purposes, so it makes sense to reuse it this special way in the Lazy Index Spool subclass.

Using the uniquifier to distinguish a cached NULL from a cached empty set is a creative solution to the problem identified by the reader’s question.

For another example of how spools use uniquifiers, see the short video in A Follow Up On HT Waits, Row Mode, Batch Mode, and SQL Server Error 666 by Erik Darling.

Thanks for reading.