Cardinality Estimation for Disjunctive (OR) Predicates
Introduction
Back in January 2014, I wrote an article called Cardinality Estimation for Multiple Predicates that described the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators.
The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags. I described the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected with AND
(conjunctive predicates), which was already relatively well-known.
Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by OR
. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.
I gave up on it until a few weeks ago, when I was invited to contribute to the technical review of a White Paper Joe Sack was writing. One of the contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions.
The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):
A Bit Of Boolean Logic
The key bit of new information is the second sentence. As soon as I read it, a vague memory from my early education came back to me, as well as from a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:
This gives us a general way to turn a disjunction into a conjunction.
It is a short step from there to understand that exponential backoff for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws.
Converting the troublesome OR
to an AND
(with the appropriate negations) allows us to reuse the AND
exponential backoff calculation for OR
predicates.
The Theory
Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:
The interesting transformation from disjunction to conjunction can be expressed as:
not (A or B)
is the same as(not A) and (not B)
This does not quite give us a direct translation from OR
to AND
. We need the extra step of observing:
(A or B)
is the same asnot (not (A or B))
This trivial double-negative means we can now directly substitute the not (A or B)
above, to get:
(A or B)
is the same asnot ((not A) and (not B))
This gives us OR
expressed in terms of NOT
and AND
.
The last bit of theory we need is that A and B are probabilities expressed as a fraction between 0 and 1, so:
(not X) = (1 – X)
Worked Example
Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014 using the AdventureWorks sample database:
SELECT
NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE
TH.TransactionID BETWEEN 100000 AND 168412
OR TH.TransactionDate BETWEEN '20070901' AND '20080313';
There are two predicates in this query, connected by OR.
The selectivity of the predicate on Transaction ID
can be derived directly from the statistics histogram:
DBCC SHOW_STATISTICS
(
'Production.TransactionHistory',
'PK_TransactionHistory_TransactionID'
)
WITH HISTOGRAM;
The output is:
This very compact histogram gives us all the information we need to say that the range of 68,413 Transaction ID
values can be expected to produce 68,413 rows.
Expressed as a fraction of the 113,443 rows in the table, the selectivity of this predicate is 68413 / 113443 = 0.603061. We will call this value selectivity A
.
The second predicate (on TransactionDate
) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here).
The histogram calculation results in a cardinality estimate for this predicate that is also 68,413 (because I deliberately chose the predicates to qualify exactly the same physical rows).
The selectivity is likewise 0.603061. We will call this value selectivity B
.
Enter DeMorgan
We know from our earlier work with DeMorgan that:
(A or B)
is the same asnot ((not A) and (not B))
So we can rewrite our (A or B)
predicate immediately as not ((not A) and (not B))
. This means we can stop worrying about the OR
.
Note that this transformation is done purely for cardinality estimation purposes — the OR
is not removed from the optimizer tree.
Doing the Math
Now, given that selectivities A
and B
are both equal to 0.603061 in this contrived example, our expanded expression becomes:
not ((not 0.603061) and (not 0.603061))
We also know:
(not X) = (1 – X)
Expanding the inner not
s means the computation becomes:
not ((1 - 0.603061) and (1 - 0.603061))
Now we can perform the numeric math inside the parentheses to get:
not (0.396939 and 0.396939)
Now we use exponential backoff to compute the AND
selectivity:
not (0.396939 * SQRT(0.396939))
And replace the outer not
just as we did before:
1 - (0.396939 * SQRT(0.396939))
Now we have just numbers. The final selectivity is 0.749916.
The Result
The final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:
0.749916 * 113443 = 85073
(rounded up)
The 2014 execution plan for this query using the new cardinality estimator is:
The cardinality estimate is 85,073 rows, just as we predicted.
The actual number of rows returned by this query is 68,413 rows (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.
Thanks for reading.