Tuesday, 28 July 2015

Improving the Pre-2012 Row Numbering Median Solution

The fastest way to compute a median uses the SQL Server 2012 OFFSET extension to the ORDER BY clause. Running a close second, the next fastest solution uses a dynamic cursor that works on all versions. This article looks at a common pre-2012 ROW_NUMBER solution to the median calculation problem to see why it performs less well, and what can be done to make it go faster.

Single Median Test

The sample data for this test consists of a single ten million row table (reproduced from Aaron Bertrand’s original article):

CREATE TABLE dbo.obj
(
    id  integer NOT NULL IDENTITY(1,1), 
    val integer NOT NULL
);

INSERT dbo.obj WITH (TABLOCKX) 
    (val)
SELECT TOP (10000000) 
    AO.[object_id]
FROM sys.all_columns AS AC
CROSS JOIN sys.all_objects AS AO
CROSS JOIN sys.all_objects AS AO2
WHERE AO.[object_id] > 0
ORDER BY 
    AC.[object_id];

CREATE UNIQUE CLUSTERED INDEX cx 
ON dbo.obj(val, id);

The `OFFSET` solution

To set the benchmark, here is the SQL Server 2012 (or later) OFFSET solution created by Peter Larsson:

DECLARE @Start datetime2 = SYSUTCDATETIME();

DECLARE @Count bigint = 10000000
--(
--    SELECT COUNT_BIG(*) 
--    FROM dbo.obj AS O
--);

SELECT 
    Median = AVG(1.0 * SQ1.val)
FROM 
(
    SELECT O.val 
    FROM dbo.obj AS O
    ORDER BY O.val
    OFFSET (@Count - 1) / 2 ROWS
    FETCH NEXT 1 + (1 - (@Count % 2)) ROWS ONLY
) AS SQ1;

SELECT Peso = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME());

The query to count the rows in the table is commented out and replaced with a hard-coded value so as to concentrate on the performance of the core code. With a warm cache and execution plan collection turned off, this query runs for 910 ms on average on my test machine. The execution plan is shown below:

OFFSET single median execution plan

As a side note, it is interesting that this moderately complex query qualifies for a trivial plan:

The `ROW_NUMBER` solution

For systems running SQL Server 2008 R2 or earlier, the best-performing of the alterative solutions uses a dynamic cursor as mentioned previously. If you are unable (or unwilling) to consider that as an option, it is natural to think about emulating the 2012 OFFSET execution plan using ROW_NUMBER.

The basic idea is to number the rows in the appropriate order, then filter for just the one or two rows needed to compute the median. There are several ways to write this in Transact SQL; a compact version that captures all the key elements is as follows:

DECLARE @Start datetime2 = SYSUTCDATETIME();

DECLARE @Count bigint = 10000000
--(
--    SELECT COUNT_BIG(*) 
--    FROM dbo.obj AS O
--);

SELECT AVG(1.0 * SQ1.val) FROM 
(
    SELECT
        O.val,
        rn = ROW_NUMBER() OVER (
            ORDER BY O.val)
    FROM dbo.obj AS O
) AS SQ1
WHERE 
    SQ1.rn BETWEEN (@Count + 1)/2 AND (@Count + 2)/2;

SELECT Pre2012 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME());

The resulting execution plan is quite similar to the OFFSET version:

ROW_NUMBER single median execution plan

It is worth looking at each of the plan operators in turn to understand them fully:

The Segment operator is redundant in this plan. It would be required if the ROW_NUMBER ranking function had a PARTITION BY clause, but it does not. Even so, it remains in the final plan.
The Sequence Project adds a calculated row number to the stream of rows.
The Compute Scalar defines an expression associated with the need to implicitly convert the val column to numeric so it can be multiplied by the constant literal 1.0 in the query. This computation is deferred until needed by a later operator (which happens to be the Stream Aggregate). This runtime optimization means the implicit conversion is only performed for the two rows processed by the Stream Aggregate, not the 5,000,001 rows indicated for the Compute Scalar.
The Top operator is introduced by the query optimizer. It recognises that at most, only the first (@Count + 2) / 2 rows are needed by the query. We could have added a TOP ... ORDER BY in the subquery to make this explicit, but this optimization makes that largely unnecessary.
The Filter implements the condition in the WHERE clause, filtering out all but the two ‘middle’ rows needed to compute the median (the introduced Top is also based on this condition).
The Stream Aggregate computes the SUM and COUNT of the two median rows.
The final Compute Scalar computes the average from the sum and count.

Raw Performance

Compared with the OFFSET plan, we might expect that the additional Segment, Sequence Project, and Filter operators are going to have some adverse effect on performance. It is worth taking a moment to compare the estimated costs of the two plans:

The OFFSET plan has an estimated cost of 0.0036266 units, while the ROW_NUMBER plan is estimated at 0.0036744 units. These are very small numbers, and there is little difference between the two.

So, it is perhaps surprising that the ROW_NUMBER query actually runs for 4000 ms on average, compared with 910 ms average for the OFFSET solution. Some of this increase can surely be explained by the overhead of the extra plan operators, but a factor of four seems excessive. There must be more to it.

You have probably also noticed that the cardinality estimates for both estimated plans above are pretty hopelessly wrong. This is due to the effect of the Top operators, which have an expression referencing a variable as their row count limits. The query optimizer cannot see the contents of variables at compilation time, so it resorts to its default guess of 100 rows. Both plans actually encounter 5,000,001 rows at runtime.

This is all very interesting, but it does not directly explain why the ROW_NUMBER query is more than four times slower than the OFFSET version. After all, the 100 row cardinality estimate is just as wrong in both cases.

Options

In my previous article, we saw how the performance of the grouped median OFFSET test could be almost doubled by simply adding a PAGLOCK hint. This hint overrides the storage engine’s normal decision to acquire and release shared locks at the row granularity (due to the low expected cardinality).

As a further reminder, the PAGLOCK hint was unnecessary in the single median OFFSET test due to a separate internal optimization that can skip row level shared locks, resulting in only a small number of intent-shared locks being taken at the page level.

We might expect the ROW_NUMBER single median solution to benefit from the same internal optimization, but it does not. Monitoring locking activity while the ROW_NUMBER query executes, we see over half a million individual row level shared locks being taken and released.

This is the problem with undocumented internal optimizations: we can never be sure when they will and will not be applied.

So, now we know what the problem is, we can improve the locking performance in the same way we did previously, either with a PAGLOCK lock granularity hint or by increasing the cardinality estimate using documented trace flag 4138.

Choosing a fix

Disabling the “row goal” using the trace flag is the less satisfactory solution for several reasons. First, it is only effective in SQL Server 2008 R2 or later. We would most likely prefer the OFFSET solution in SQL Server 2012, so this effectively limits the trace flag fix to SQL Server 2008 R2 only. Second, applying the trace flag requires administrator-level permissions, unless applied via a plan guide. A third reason is that disabling row goals for the whole query may have other undesirable effects, especially in more complex plans.

By contrast, the PAGLOCK hint is effective, available in all versions of SQL Server without any special permissions, and does not have any major side effects beyond locking granularity.

Impact

Applying the PAGLOCK hint to the ROW_NUMBER query increases performance dramatically, from 4000 ms to 1500 ms:

DECLARE @Start datetime2 = SYSUTCDATETIME();

DECLARE @Count bigint = 10000000
--(
--    SELECT COUNT_BIG(*) 
--    FROM dbo.obj AS O
--);

SELECT AVG(1.0 * SQ1.val) FROM 
(
    SELECT
        O.val,
        rn = ROW_NUMBER() OVER (
            ORDER BY O.val)
    FROM dbo.obj AS O WITH (PAGLOCK) -- New!
) AS SQ1
WHERE 
    SQ1.rn BETWEEN (@Count + 1)/2 AND (@Count + 2)/2;

SELECT Pre2012 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME());

The 1500 ms result is still significantly slower than the 910 ms for the OFFSET solution, but at least it is now in the same ballpark. The remaining performance differential is simply due to the extra work in the execution plan:

In the OFFSET plan, five million rows are processed as far as the Top (with the expressions defined at the Compute Scalar deferred as discussed earlier). In the ROW_NUMBER plan, the same number of rows have to be processed by the Segment, Sequence Project, Top, and Filter.

Thanks for reading.