Monday, 02 December 2024

Improved RCSI Ghost Cleanup in SQL Server 2022

Background

The read committed snapshot isolation (RCSI) isolation level provides many benefits. Chief among them:

Readers won’t block writers (and vice versa).
Each statement sees a point-in-time view of the committed database (with exceptions for submodule calls like non-inlined functions).

On the other side of the coin, we have the costs of maintaining the row versions needed by the RCSI implementation.

This is not just a case of making sure tempdb (or the user database if ADR is in use) is large enough and can handle the extra concurrent activity:

Writers changing existing data must create row versions (some inserts too).
The system must check and maintain the version store in the background.
Readers have to locate the correct version of each row.

That last item can cause a significant performance reduction when long version chains are involved. This is somewhat less likely to occur under RCSI than SI because each RCSI statement sees a more recent point in time. You might still encounter issues with long-running RCSI statements like reports, but that’s not the focus of this article.

There is a separate issue with RCSI that I want to cover today.

Demo

It is not unusual to find a frequently updated table in a database using RCSI. This toy example boils the scenario down to one row in a table with a single indexed column:

DROP TABLE IF EXISTS dbo.Test;
GO
CREATE TABLE dbo.Test (id integer PRIMARY KEY CLUSTERED);
INSERT dbo.Test (id) VALUES (0);
GO
CHECKPOINT;

To simulate the effect of a reasonable number of separate modification statements, we’ll change the value stored in the row 123,456 times:

SET NOCOUNT, XACT_ABORT ON;
SET STATISTICS XML OFF;

DECLARE 
    @i integer = 1;

WHILE @i < 123456
BEGIN
    BEGIN TRANSACTION;

    UPDATE TOP (1) 
        dbo.Test WITH (TABLOCKX)
    SET @i = id = @i + 1;

    COMMIT TRANSACTION 
        WITH (DELAYED_DURABILITY = ON);
END;

Note: The TOP (1) and TABLOCKX hints make the updates faster by lowering the locking overhead and avoiding unnecessary Halloween Protection in the execution plan.

Updates under RCSI (without Optimized Locking) acquire update locks when locating qualifying rows to change. The reason the table lock is an optimization will be explained shortly.

We’re only ever updating one row (over and over again), so the Halloween Protection provided by the Top operator is sufficient to avoid an Eager Table Spool.

Delayed durability is used because we’re not looking to test logging performance here.

Feel free to run the example without those optimizations if you prefer.

The (optimized) estimated execution plan is:

Estimated update execution plan

The redundant Top operator is just Halloween Protection doing its thing.

After each run, we’ll look at the physical index statistics with this DMV query:

SELECT
    [level] = IPS.index_level,
    [pages] = IPS.page_count,
    [used %] = FORMAT(IPS.avg_page_space_used_in_percent, 'N4'),
    [avg size] = IPS.avg_record_size_in_bytes,
    [rows] = IPS.record_count,
    [ghosts] = IPS.ghost_record_count,
    [versions] = IPS.version_record_count,
    [ghost versions] = IPS.version_ghost_record_count
FROM sys.dm_db_index_physical_stats
(
    DB_ID(), 
    OBJECT_ID(N'dbo.Test', 'U'), 
    1,      -- index id
    NULL,   -- partition 
    'DETAILED'
) AS IPS
ORDER BY
    IPS.index_level;

1. SQL Server 2019 Locking Read Committed

This first test runs on SQL Server 2019 CU29-GDR (build 15.0.4410)—the latest available as of December 2, 2004.

The test database is set to use the simple recovery model and allow delayed durability. It has ADR, SI, and RCSI all disabled:

ALTER DATABASE CURRENT
    SET RECOVERY SIMPLE;

ALTER DATABASE CURRENT 
    SET READ_COMMITTED_SNAPSHOT OFF
    WITH ROLLBACK IMMEDIATE;

ALTER DATABASE CURRENT
    SET ALLOW_SNAPSHOT_ISOLATION OFF;

ALTER DATABASE CURRENT
    SET DELAYED_DURABILITY = ALLOWED;

ALTER DATABASE CURRENT
    SET ACCELERATED_DATABASE_RECOVERY = OFF;

This sets a performance baseline using locking read committed. In this configuration, the test runs for roughly 2 seconds.

The DMV query shows a single page in the table containing one row and no ghosts, as expected:

level	pages	used %	avg size	rows	ghosts	versions	ghost versions
0	1	0.1359	11	1	0	0	0

As a side note, without delayed transaction durability the test runs for 12 seconds.

Ghost story

That might seem like a long time, but we’re changing the key column of the index so the update is performed as a delete followed by an insert. The delete marks the row as a ghost and the insert adds a new row.

The next iteration of the loop will encounter the ghost row during its index scan and add it to the background ghost cleanup task’s queue. Depending on how quickly ghost cleanup proceeds, subsequent iterations might encounter many ghost rows before locating the one non-deleted row.

In this particular test, using a laptop with no other activity, the ghost cleanup task does very well—there are no ghosts in the table at the end of the test. This is not always the case even on modern SQL Server versions with their long history of ghost cleanup fixes, as Forrest McDaniel discovered in Fixing Queues with Watermarks.

2. SQL Server 2019 RCSI

Let’s now switch from locking read committed to RCSI without changing anything else.

ALTER DATABASE CURRENT 
    SET READ_COMMITTED_SNAPSHOT ON
    WITH ROLLBACK IMMEDIATE;

Running exactly the same test again results in an execution time of 1m 22s. That’s significantly longer than the 2s using the locking read committed isolation level.

The DMV query shows:

level	pages	used %	avg size	rows	ghosts	versions	ghost versions
0	86	49.8145	25	1	12301	12848	547
1	1	13.7880	11	86	0	0	0

Explanation

The big difference this time is that the scan can’t queue ghosts the update encounters for cleanup because their transaction sequence number isn’t older than the current oldest active transaction. This is just the normal way SQL Server decides if it safe to dispose of versions and ghosts or not under row versioning isolation levels.

Note that this would still be the case with an autocommit transaction instead of the explicit one used in the test. An autocommit transaction would also require us to use FORCED delayed durability because there’s no COMMIT statement to attach the durability option to. If you try that configuration, you’ll find performance is the same. You can save a bit of typing by using implicit transactions, but again, performance will be the same.

Once each update transaction completes, normal ghost and version store cleanup can start removing the deleted rows from the table. That’s an asynchronous process on a separate system thread, so it will be running at the same time as the next update.

The situation at the end of the test shows an intermidate state after some ghosts have been exorcised but not all of them.

After cleaning

A few seconds later, the DMV query returns:

level	pages	used %	avg size	rows	ghosts	versions	ghost versions
0	2	0.3089	25	1	1	1	0
1	1	0.2965	11	2	0	0	0

Notice that the index acquired a second level, meaning the original page split during the test. This happened because the page was full of active version ghosts. The insert part of the update statement had to split the page and add a new index root. This happened many times as we can infer from the previous DMV output showing 86 pages at the leaf level, all roughly 50% full as is typical after a split.

The leaf-level row size has increased from 11 bytes to 25 due to the 14-byte row versioning overhead. Non-leaf levels do not use versioning or ghosting.

The one remaining ghost is on the original index page. SQL Server leaves a single ghost record when it is the root page of the index according to system metadata. The one non-deleted row is on a different page. All the other leaf level index pages have been cleaned up and disposed of since they were empty and not the first metadata page.

Scarier ghost story

The terrible performance under RCSI is explained by the increased number of ghost records each update scan encounters and has to skip over to get to the one visible row.

Not only does SQL Server have to find the row to discover that it is marked as a ghost, it also has to request an update lock on it beforehand. The lock is ultimately not acquired because it is covered by the existing TABLOCKX, but the lock manager still has to process the request and discover it is redundant due to the already held lock. Without the locking hint, real locks are taken and performance is much worse.

None of the foregoing takes very long individually, but monitoring shows SQL Server was handling around 6 million lock requests per second during the test (and skipping the same number of ghosts).

Remember, update locks are required when locating qualifying rows even under RCSI when Optimized Locking isn’t available or enabled. That is an Azure SQL Database only feature at the time of writing. I expect we’ll see it in SQL Server 2025.

Ultimately, the problem is caused by having to request a lock on—and then skip—hundreds of millions of ghosts during the test. Remember, we’re dealing with a one column table with a single row.

Ghost and version cleanup under RCSI can’t start until the end of each transaction and the next update starts almost immediately. That update will encounter version ghosts from the prior run(s) as well as those it creates itself.

But wait, it gets worse.

3. SQL Server 2019 RCSI with an open transaction

This third test is a repeat of the second test with an open transaction. Drop and recreate the test table as usual, but then start a transaction in a separate session connected to the same database:

-- Will be RCSI
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION;

-- Leave this transaction open
BEGIN TRANSACTION;
    SELECT COUNT_BIG(*) FROM dbo.AccessMe AS AM;

It is important that the transaction runs under a row versioning isolation level and has truly started (by accessing a user table in the target database). You can use any table in the above code block, even the dbo.Test table itself.

The important thing is that the transaction appears in the DMV sys.dm_tran_active_snapshot_database_transactions. Notice that the transaction does not need to be using snapshot isolation—RCSI will do just fine.

Terrifying ghost story

Running the update test now takes over four minutes. Without the delayed durability optimization, that becomes half an hour. Recall that the locking read committed test required just 12 seconds in those conditions. RCSI without the open transaction took 1m 22s. We’re only updating one row 123,456 times!

The DMV query returns:

level	pages	used %	avg size	rows	versions	ghost versions
0	823	50.0148	25	1	123455	123455
1	2	66.0675	11	823	0	0
2	1	0.2965	11	2	0	0

The index for our one-row table now has three levels, with 823 pages at the leaf. There are 123,455 ghost versions—one for every update performed. These ghosts can’t be cleaned up until the transaction we left open commits or rolls back. Each ghost has a versioned row in the version store.

Details

Each of the 123,456 individual single row updates had to skip one less ghost row than its iteration number. For example, update number 100,000 had to skip 99,999 ghost rows.

Remember, each skipped row is preceeded by a lock request. The locking can’t be avoided because SERIALIZABLE transactions can lock ghosted rows. Not honouring a lock could result in a violation of the ACID guarantees.

Hopefully, you can see how inefficient this all is and where the time goes. Wait statistics won’t help because SQL Server is never waiting for a resource—it’s working as hard as it can requesting locks and skipping ghosts at an impressive rate.

Unless you know to look at performance counters for skipped ghost records, it seems like SQL Server is just taking up to half an hour to update one row 123,456 times, with one core maxed out.

Don’t forget to close the open transaction!

The ghost versions will be cleaned up within a few seconds after you do that.

SQL Server 2022 Improvement

On SQL Server 2022, the locking read committed test completes in 2-3 seconds, as before. The RCSI test runs for about 1m 28s, again about the same as on 2019.

The big difference is with the open transaction. That test now runs for roughly the same time as without the open transaction—1m 32s in the test I ran just now. That’s much better than over 4 minutes, but still not great.

At the end of the test, the DMV shows that some ghosts were cleaned during the run:

level	pages	used %	avg size	rows	ghosts	versions	ghost versions
0	231	49.9272	25	1	34201	34590	389
1	1	37.0768	11	231	0	0	0

A few seconds later, the cleaning is complete:

level	pages	used %	avg size	rows	ghosts	versions	ghost versions
0	2	0.3089	25	1	1	1	0
1	1	0.2965	11	2	0	0	0

Analysis

The open RCSI transaction no longer prevents ghost cleanup, which explains the improved performance. Fewer ghosts are encountered, so fewer locks are requested and fewer rows are skipped. Rather than 6 million lock requests per second, the server now settles at around 10,000 reflecting the lower average number of ghost versions encountered.

SQL Server 2022 enables statement-level ghost cleanup under RCSI.

It’s still nowhere near as fast as locking read committed, but it’s much better than on 2019. If you ever need to disable this improvement on SQL Server 2022, undocumented global trace flag 7006 serves that purpose.

Important

The improvement in SQL Server 2022 only allows ghosts to be cleaned up earlier when there is a concurrent active RCSI transaction. The version store records are unaffected and will remain until the end of the transaction as before.

This much is at least documented:

Even though READ COMMITTED transactions using row versioning provides a transactionally consistent view of the data at a statement level, row versions generated or accessed by this type of transaction are maintained until the transaction completes.

There appears to be a pending improvement to address this, but it is currently hidden behind an undocumented trace flag and isn’t fully functional even with it enabled. Something to watch out for in SQL Server 2025, perhaps.

An open snapshot isolation transaction still prevents early ghost cleanup because that transaction might legitimately need to encounter those ghosts and ghosted versions to provide a view of the committed database as it was when the transaction started. Remember, RCSI only offers a view as of the start of the current statement.

Final Thoughts

You may not rapidly update a single-row table and access it with a locking scan as literally shown in these tests. Nevertheless, ghosts are a fact of life and maintained even for heap tables when a row versioning isolation level is in use.

Scans may be frowned upon, but remember all seeks have a scanning component unless they are equality tests on an enforced key (a ‘singleton seek’ into a unique index or constraint). The range scan component of a seek may have to skip ghosts to find rows it can logically ‘see’.

Row versioning comes with costs, even without long version chains. None of the examples in this article resulted in more than a single entry in each version chain. If that sounds wrong, consider that the update is implemented as a delete followed by an insert. There is a new live row on each iteration.

The convenience of RCSI and SI do not come for free. In some cases, the effects can be dramatic and hard to diagnose unless you already know what you’re looking for.

It’s not currently possible to exclude some tables from row versioning. The setting is per database and all-or-nothing, heavily updated tables and indexed views as well as static ones.

Specifying READCOMMITTEDLOCK only affects whether shared locks are acquired, it won’t help you avoid ghosts and ghost versions.

Row versions are always generated and maintained when a row versioning isolation level is enabled—even if it is never used.

Long-running transactions are always bad news, especially if they are idle and unintentional. SQL Server 2022 can help avoid the worst of the performance impacts shown here, but the baseline RCSI run time (1.5 minutes) was still pretty poor compared with locking read committed (2 seconds). That occurred without any other open transactions.

ADR was disabled for these tests, but the core issues reproduce the same with it enabled. Having ADR makes life more difficult because the performance counters for skipped ghost records and version store size do not function, without an obvious replacement.

Thanks for reading.