How a Lazy Index Spool Works

Introduction
I had a question from a reader of my article, Nested Loops Joins and Performance Spools, about the section describing the Apply Lazy Index Spool. The question was about how the spool behaves when a cached result returns no rows.
Quoting from the linked article:
The Lazy Index Spool operator is only available with apply.
The index spool maintains a worktable that is not truncated when outer reference values change. Instead, new data is added to the existing cache, indexed by the outer reference values. A lazy index spool differs from a lazy table spool in that it can replay results from any prior loop iteration, not just the most recent one.
This mechanism is easy to understand when a prior iteration cached a result that the spool can replay (a rewind). Presumably (the questioner says), when the cached result is NULL
, that is cached as well.
The question is: What happens when the prior iteration returned no rows at all? Does the spool somehow cache the empty result (being distinct from a NULL
), or does it execute the spoolâs input subtree again (effectively, a rebind)?
This is an insightful question and deserves a proper response.
Conditions
I gave the necessary conditions to produce an Apply Lazy Index Spool in the prior article:
The optimizer requires:
- Some duplicate join values on the outer input.
- An equality join predicate (or a logical equivalent the optimizer understands, such as
x <= y AND x >= y
).- A guarantee that the outer references are unique below the proposed lazy index spool.
An interesting restatement of these requirements is that the spool can cache a maximum of one result row for each value of the correlated apply parameter(s). In other words, the correlated apply parameters form a key for the spoolâs cache.
Demo
To investigate how the spool functions internally, we ideally need to see the contents of the spoolâs worktable.
There are a couple of ways we could achieve this, but the most convenient is to use an undocumented trace flag. Apologies to anyone who canât set trace flags, but thatâs what you get for invoking the awesome power of the cloud.
Iâm going to use one of my favourite undocumented trace flagsâone I donât think I have revealed before. It is trace flag 600, which prints a row image for inserts to a b-tree structure (not heaps). The spoolâs worktable is organized as a clustered index, so this will work perfectly here.
The demo is a slight variation on the one used in the prior article:
CREATE TABLE #S
(
c1 char(4) NOT NULL
);
GO
INSERT #S (c1)
VALUES
-- Some duplicates
('aaaa'),
('bbbb'),
('cccc'),
('dddd'),
('aaaa'),
('bbbb');
GO
DBCC TRACEON (600, 3604)
WITH NO_INFOMSGS;
SELECT
T1.c1,
CA.x
FROM #S AS T1
CROSS APPLY
(
SELECT
x = MAX(U.x)
FROM
(
SELECT T1.c1
UNION
SELECT T1.c1
) AS U (x)
WHERE
U.x > 'd'
GROUP BY
()
) AS CA;
DBCC TRACEOFF (600, 3604)
WITH NO_INFOMSGS;
DROP TABLE #S;
The query itself is just complex enough to prevent the optimizer from transforming it to a simpler form that does not require a spool, while meeting all the other requirements.
One key aspect is the GROUP BY ()
clause. This ensures the MAX
aggregate behaves as a vector aggregate, rather than as a scalar aggregate. If you need a refresher on that, see my article Fun with Scalar and Vector Aggregates.
In SQL Server, a scalar aggregate always returns a non-empty result for an empty input (NULL
for the MAX
case); a vector aggregate returns no rows at all, which is what we want here.
Overall, the effect is that the APPLY
above returns one row containing âddddâ when the correlated parameter value is âddddâ. It returns an empty result set for all other values.
The results are:
The execution plan is:
Plan analysis
Unfortunately, the Actual Rebinds and Actual Rewinds properties of a Lazy Index Spool are not reliable due to implementation details.
The spool in the plan above reports six rebinds and zero rewinds. This reflects the fact that the correlated parameter value changed six times and matched the immediately prior value zero times:
This outcome corresponds to the Table Scan producing rows in the order they were inserted:
- âaaaaâ â first value = rebind.
- âbbbbâ â not the same = rebind.
- âccccâ â not the same = rebind.
- âddddâ â not the same = rebind.
- âaaaaâ â matches a previous value, but not the prior one, reported as a rebind.
- âbbbbâ â matches a previous value, but not the prior one, reported as a rebind.
These rebind and rewind values would be correct for a Table Spool, which only caches the prior result and truncates its worktable on a rebind. The numbers are not correct for a Lazy Index Spool.
Actual behaviour
We can see the Lazy Index Spool behaved differently by looking at the Number of Executions reported by operators in its input tree, for example the Stream Aggregate.
This reports four separate executions, which makes sense because there are four unique values in the driving set:
We can infer from this that the Lazy Index Spool did indeed cache results for âaaaaâ, âbbbbâ, âccccâ, and âddddâ and reuse those empty results for the later duplicated values âaaaaâ and âbbbbâ.
If the spoolâs properties worked correctly, they would report four rebinds and two rewinds for this example.
Worktable contents
While the plan analysis strongly suggests that empty results were cached and reused by the Lazy Index Spool, we need to look at the rows stored in its worktable for confirmation.
The trace flag output is:
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 19
Memory Dump @0x000000BC5EBF6E70
0000000000000000: 10001000 61616161 00000000 00000000 040004 ....aaaa...........
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 19
Memory Dump @0x000000BC5EBF6E70
0000000000000000: 10001000 62626262 704a0740 00000000 040004 ....bbbbpJ.@.......
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 19
Memory Dump @0x000000BC5EBF6E70
0000000000000000: 10001000 63636363 704a0740 00000000 040004 ....ccccpJ.@.......
Insert row (dbid=2 rowsetid=422213760188416) by xact (xactid=1749795, xts=0)
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP Record Size = 19
Memory Dump @0x000000BC5EBF6E70
0000000000000000: 10001000 64646464 64646464 01000000 040000 ....dddddddd.......
As a side note, the reported rowsetid
is useful if you want to use DBCC PAGE
to view the worktableâs pages after execution. They are no longer allocated to the spool of course, but they are very likely to stick around in the buffer pool, and so be visible through sys.dm_os_buffer_descriptors
where the database_id
will be 2 (for tempdb
) and the allocation_unit_id
is the rowsetid
.
More importantly, we can directly see the row image for rows inserted to the spool. The format isnât as friendly as DBCC PAGE
style 3, but that isnât available with TF 600âand simply not possible when viewing pages after the event because the metadata describing the row layout was deleted with the spool.
Row Image Decoding
That leaves us with decoding the row image by hand. The FixedVar format is well known and described, for example in SQL Server Storage Internals 101 by Mark S Rasmussen.
Letâs look at the first row image in detail:
10001000 61616161 00000000 00000000 040004
- The first two bytes are Status Bits A & B.
- The next two bytes are
1000
, the length of the fixed-length portion. Remembering that Windows uses little endian memory format, with the least significant byte first, the value is0010
= 16 bytes. - The next four bytes are
61616161
= âaaaaâ. This is the value of the outer reference parameter value used as part of the spool indexâs key. - The next four bytes are all zero. This is the cached execution subtree result for the key value.
- The next four bytes are also all zero. I will describe these shortly.
- The next two bytes are
0400
, which in little endian format means0004
. There are four columns in the fixed-length portion of the row. - The final byte is the
NULL
bitmap. The value04
=0100
in binary, meaning the third column isNULL
. The other columns are all notNULL
.
Number of columns
There are a few questions here. The most obvious one is that the metadata indicates the row has four fixed-length columns, but I only described three above.
The answer to this first puzzle is that the general spool worktable is structured as a clustered index with a zero-length key column. Rows are distinguished as necessary by a uniquifier, as is normal for a non-unique clustered index.
Stored values
The original spools (rowcount, table, and index) share a common base, which uses this zero-length key with an optional uniquifier. In the Lazy Index Spool case, the index key(s) to the spool are added after the invisible zero-length one.
The columns in the worktable are therefore:
- Zero length clustering key prefix (invisible because it is zero length).
- Spool key column (
c1
outer reference value) e.g. âaaaaâ. - Matching cached value from the spoolâs subtree.
- The uniquifier.
The NULL
bitmap in the row image indicates that only column 3 is NULL
. This is the single cached value in the spool for the corresponding outer reference value key. The value stored in the fixed-length area is not important when the bitmap inidcates the value is NULL
. In this case, 00000000
happens to be stored.
The NULL
third column could mean that the spoolâs subtree returned a NULL
for the associated key, or that the subtree returned no row at all. This is exactly the case we are looking to distinguish.
The uniquifier for this row is 00000000
.
Note in passing that this is an interesting example of the uniquifier being written to the fixed-length portion of the row. The uniquifier is normally a variable-length column in user tables because it isnât needed for non-duplicate clustering key values.
Other rows
The second and third row images are very similar, differing mainly in the spool key storing âbbbbâ (62626262
) or âccccâ (63636363
):
10001000 62626262 704a0740 00000000 040004 -- 'bbbb'
10001000 63636363 704a0740 00000000 040004 -- 'cccc'
The value stored in the NULL
third column (704a0740
) is unimportant. On row insertions after the first, the register holding the value that ends up being written for the third column happens to be non-zero.
Remember, the invisible zero-length key is the first column, the spool key comes next, followed by the cached value. The uniquifier is the last of the four columns.
The final row in the spoolâs worktable (âddddâ) is the only one that returns a result:
10001000 64646464 64646464 01000000 040000 -- 'dddddd'
The third column now contains 64646464
, indicating that âddddâ is the cached result for key âddddâ. The NULL
bitmap is now all zeroesâno NULL
columns are present in this row because there is a non-NULL
cached result.
The important observation is that the uniquifier now contains 00000001
. This indicates the spool has a valid cached row value for this rowâs key. For all other rows, where the result was empty, the uniquifier was 00000000
âan invalid value, since uniquifiers start at one and are only ever positive.
The other rows in the worktable indicate that the outer reference value has been seen, but the result is empty (not NULL
) because the uniquifier has a zero value. The spool therefore does not need to execute the subtree again. It can rewind to return âno rowsâ for the repeated values âaaaaâ and âbbbbâ.
The uniquifier is how the spool differntiates a NULL
cached result from an empty one.
Cached NULLs
For completeness, letâs run the test script again without the GROUP BY ()
clause. This means the MAX
is now a scalar aggrgate, which will return NULL
on an empty input rather than no rows at all.
CREATE TABLE #S
(
c1 char(4) NOT NULL
);
GO
INSERT #S (c1)
VALUES
-- Some duplicates
('aaaa'),
('bbbb'),
('cccc'),
('dddd'),
('aaaa'),
('bbbb');
GO
DBCC TRACEON (600, 3604)
WITH NO_INFOMSGS;
SELECT
T1.c1,
CA.x
FROM #S AS T1
CROSS APPLY
(
SELECT
x = MAX(U.x)
FROM
(
SELECT T1.c1
UNION
SELECT T1.c1
) AS U (x)
WHERE
U.x > 'd'
--GROUP BY
-- ()
) AS CA;
DBCC TRACEOFF (600, 3604)
WITH NO_INFOMSGS;
DROP TABLE #S;
The results are:
The execution plan is the same in all respects, including reported rewinds and rebinds. Showplan output does not currently distinguish scalar from vector aggregates.
There are still four rows in the spool, representing the four distinct correlated parameter values encountered. The row images are:
10001000 61616161 00000000 01000000 040004 -- aaaa
10001000 62626262 708a533f 01000000 040004 -- bbbb
10001000 63636363 708a533f 01000000 040004 -- cccc
10001000 64646464 64646464 01000000 040000 -- dddd
The NULL
bitmap indicates the first three rows have a NULL
cached value. The 00000001
uniquifier indicates this is a valid literal NULL
, not an empty set.
The fourth row has a non-NULL
cached value (âddddâ) with the 00000001
uniquifier again indicating a valid cached value is present.
The spool was able to reuse (rewind) the stored NULL
values for the repeated interations with c1
outer reference values of âaaaaâ and âbbbbâ.
Final Thoughts
This is an interesting demo for several reasons. Being able to see the inserted row image using the undocumented trace flag is generally useful and frequently easier than other methods, even with the manual decoding.
Storing the uniquifier in the fixed-length portion of the row is a great example of SQL Server breaking with observed norms for internal purposes. I donât believe the uniquifier is documented anywhere to be variable length in user tables, but that has been widely assumed over the years by experimenters.
Of course, it could be more efficient to store a ârow validâ flag as a one byte fixed length value, but the underlying spool architecture already requires a four byte uniquifier for general purposes, so it makes sense to reuse it this special way in the Lazy Index Spool subclass.
Using the uniquifier to distinguish a cached NULL
from a cached empty set is a creative solution to the problem identified by the readerâs question.
For another example of how spools use uniquifiers, see the short video in A Follow Up On HT Waits, Row Mode, Batch Mode, and SQL Server Error 666 by Erik Darling.
Thanks for reading.