Tuesday, 15 April 2014

Cardinality Estimation for Disjunctive (OR) Predicates

Introduction

Back in January 2014, I wrote an article called Cardinality Estimation for Multiple Predicates that described the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators.

The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags. I described the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected with AND (conjunctive predicates), which was already relatively well-known.

Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by OR. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.

I gave up on it until a few weeks ago, when I was invited to contribute to the technical review of a White Paper Joe Sack was writing. One of the contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions.

The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):

Exponential back-off description

A Bit Of Boolean Logic

The key bit of new information is the second sentence. As soon as I read it, a vague memory from my early education came back to me, as well as from a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:

English expression of De Morgan's laws

This gives us a general way to turn a disjunction into a conjunction.

It is a short step from there to understand that exponential backoff for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws.

Converting the troublesome OR to an AND (with the appropriate negations) allows us to reuse the AND exponential backoff calculation for OR predicates.

The Theory

Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:

De Morgan's laws

The interesting transformation from disjunction to conjunction can be expressed as:

not (A or B) is the same as (not A) and (not B)

This does not quite give us a direct translation from OR to AND. We need the extra step of observing:

(A or B) is the same as not (not (A or B))

This trivial double-negative means we can now directly substitute the not (A or B) above, to get:

(A or B) is the same as not ((not A) and (not B))

This gives us OR expressed in terms of NOT and AND.

The last bit of theory we need is that A and B are probabilities expressed as a fraction between 0 and 1, so:

(not X) = (1 – X)

Worked Example

Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014 using the AdventureWorks sample database:

SELECT
    NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE
    TH.TransactionID BETWEEN 100000 AND 168412
    OR TH.TransactionDate BETWEEN '20070901' AND '20080313';

There are two predicates in this query, connected by OR.

The selectivity of the predicate on Transaction ID can be derived directly from the statistics histogram:

DBCC SHOW_STATISTICS
(
    'Production.TransactionHistory', 
    'PK_TransactionHistory_TransactionID'
)
WITH HISTOGRAM;

The output is:

Histogram

This very compact histogram gives us all the information we need to say that the range of 68,413 Transaction ID values can be expected to produce 68,413 rows.

Expressed as a fraction of the 113,443 rows in the table, the selectivity of this predicate is 68413 / 113443 = 0.603061. We will call this value selectivity A.

The second predicate (on TransactionDate) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here).

The histogram calculation results in a cardinality estimate for this predicate that is also 68,413 (because I deliberately chose the predicates to qualify exactly the same physical rows).

The selectivity is likewise 0.603061. We will call this value selectivity B.

Enter DeMorgan

We know from our earlier work with DeMorgan that:

(A or B) is the same as not ((not A) and (not B))

So we can rewrite our (A or B) predicate immediately as not ((not A) and (not B)). This means we can stop worrying about the OR.

Note that this transformation is done purely for cardinality estimation purposes — the OR is not removed from the optimizer tree.

Doing the Math

Now, given that selectivities A and B are both equal to 0.603061 in this contrived example, our expanded expression becomes:

not ((not 0.603061) and (not 0.603061))

We also know:

(not X) = (1 – X)

Expanding the inner nots means the computation becomes:

not ((1 - 0.603061) and (1 - 0.603061))

Now we can perform the numeric math inside the parentheses to get:

not (0.396939 and 0.396939)

Now we use exponential backoff to compute the AND selectivity:

not (0.396939 * SQRT(0.396939))

And replace the outer not just as we did before:

1 - (0.396939 * SQRT(0.396939))

Now we have just numbers. The final selectivity is 0.749916.

The Result

The final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:

0.749916 * 113443 = 85073 (rounded up)

The 2014 execution plan for this query using the new cardinality estimator is:

Execution plan

The cardinality estimate is 85,073 rows, just as we predicted.

The actual number of rows returned by this query is 68,413 rows (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.

Thanks for reading.