Pulling Group By Above a Join
One of the transformations available to the SQL Server query optimizer is pulling a logical Group By (and any associated aggregates) above a Join.
Visually, this means transforming a tree of logical operations from:
…to this:
The above diagrams are logical representations. They need to be implemented as physical operators to appear in an execution plan. The options are:
- Group By
- Hash Match Aggregate
- Stream Aggregate
- Distinct Sort
- Join
- Nested Loops Join
- Nested Loops Apply
- Hash Match Join
- Merge Join
When the optimizer moves a Group By above a Join it has to preserve the semantics. The new sequence of operations must be guaranteed to return the same results as the original in all possible circumstances.
One cannot just pick up a Group By and arbitrarily move it around the query tree without risking incorrect results.
Conditions
The optimizer can move a Group By above a Join using the rule GbAggAfterJoin
. The only conditions that must be met to consider this transformation safely are:
- The other input to the join must have a key.
- The join predicate must not use any aggregates computed by the moving Group By.
When the optimizer pulls a Group By above a Join, it will choose between the alternatives using estimated costs.
Demo
Using the AdventureWorks OLTP sample database, the following query joins the Product
table to a derived table containing a Group By aggregate:
SELECT P.[Name], J.*
FROM Production.Product AS P
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%'
OPTION (FORCE ORDER);
The FORCE ORDER
hint has two effects. First, it ensures the tables are physically joined in the order they appear in the query text. Second, it prevents the optimizer from reordering aggregates.
The resulting execution plan has an estimated cost of 3.64687 units:
The Group By is below the join, implemented in this case with a Hash Match Aggregate physical operator. The FORCE ORDER
hint prevented the optimizer from considering moving the Group By above the Join.
Without the hint
Removing the FORCE ORDER
hint:
SELECT P.[Name], J.*
FROM Production.Product AS P
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';
…gives a plan with an estimated cost of 1.3409 units:
Analysis
The optimizer has pulled the Group By above the Join. Aggregating after the join is assessed to be cheaper because the join eliminates many rows. The join is slightly more expensive than before because it processes more rows on the probe side, but this difference is overwhelmed by the cheaper aggregate.
The optimizer chose to implement the logical Group By as a Stream Aggregate, which requires input sorted on the grouping keys. A Sort operator is introduced to meet that requirement. The combination of Sort plus Stream Aggregate is estimated to be slightly cheaper than a Hash Match Aggregate.
The optimizer was able to consider pulling the Group By above the Join because the Product
table has a key, and the join predicate does not use any aggregates computed by the moving Group By.
Grouping Keys
The grouping keys used by the Stream Aggregate in the prior example are ProductID
and Quantity
. This matches the GROUP BY
clause of the derived table in this case, but that is because I chose a deliberately simple example to get us started.
In general, pulling a Group By above a join requires grouping by a key from the other join input.
To see this in action, let’s use a temporary table to join from instead of the Product
table. We will make the Name
column the only key using a UNIQUE
constraint:
CREATE TABLE #P
(
ProductID integer NOT NULL,
[Name] nvarchar(50)
COLLATE DATABASE_DEFAULT
NOT NULL
UNIQUE
);
GO
INSERT #P
(
ProductID,
[Name]
)
SELECT
P.ProductID,
P.[Name]
FROM Production.Product AS P;
Our query is now:
SELECT P.[Name], J.*
FROM #P AS P
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';
The execution plan is:
The Stream Aggregate now computes its COUNT(*)
aggregate grouped by (Name, Quantity)
instead of (ProductID, Quantity)
as in the query text and previous plans.
This has the same semantics as the original query because the ProductID
column is functionally determined by the Name
key.
Getting every nuance of this sort of relational transformation correct can be tricky. It is very handy that the optimizer team put the effort in so we do not have to explore these tricky rewrites manually (e.g. by changing the query text). If nothing else, it would be extremely tedious to write all the different query forms out by hand just to see which one performed better in practice. Never mind choosing a different version depending on current statistics and the number of changes to the table.
No Key
What happens if the (non-grouping) table does not have a key? Will the optimizer still be able to pull the Group By above the inner join?
Let’s find out by recreating and repopulating the temporary table without any keys:
DROP TABLE IF EXISTS #P;
GO
CREATE TABLE #P
(
ProductID integer NOT NULL,
[Name] nvarchar(50)
COLLATE DATABASE_DEFAULT
NOT NULL
-- No longer UNIQUE!
);
GO
INSERT #P
(
ProductID,
[Name]
)
SELECT
P.ProductID,
P.[Name]
FROM Production.Product AS P;
Our query:
SELECT P.[Name], J.*
FROM #P AS P
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';
…produces a very similar plan:
With no key column(s) available from the table, the optimizer has chosen the only unique identifier available — the heap row locator (RID or “bookmark”). This is projected from the Table Scan and labelled Bmk1000
in this case. A bookmark must uniquely identify each row, and is therefore a suitable key to group by.
No Table
What if there isn’t even a table? What will SQL Server do for a key then?
The following rewrite of the original query replaces the Product
table and LIKE N'A%'
predicate with a hard-coded list of values:
SELECT P.[Name], J.*
FROM
(
VALUES
(1, N'Adjustable Race'),
(879, N'All-Purpose Bike Stand'),
(712, N'AWC Logo Cap')
)
AS P (ProductID, [Name])
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID;
The execution plan has a familiar shape:
The Constant Scan is an in-memory “table” of values, but it has no row locator since it is not a physical table.
The optimizer found a key by inspecting the VALUES
clause! The two columns are projected as Union1006
and Union1007
. The optimizer can see that the literal values we entered do not contain any duplicates, so it could choose either column (ProductID
or Name
) as a key. Neat!
No Table and No Key
Let’s see what happens if we introduce a duplicate in the VALUES
clause so there is no longer a key of any kind:
SELECT P.[Name], J.*
FROM
(
VALUES
(1, N'Adjustable Race'),
(879, N'All-Purpose Bike Stand'),
(879, N'All-Purpose Bike Stand'), -- DUPLICATE!
(712, N'AWC Logo Cap')
)
AS P (ProductID, [Name])
JOIN
(
SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
GROUP BY TH.ProductID, TH.Quantity
) AS J
ON J.ProductID = P.ProductID;
The plan is:
With both columns in the VALUES
clause containing duplicates, the optimizer has one final card to play. It resorts to manufacturing a key as part of the GbAggAfterJoin
exploration.
The manufactured key is named I8Rank1013
in this instance. It refers to the result of a ROW_NUMBER
calculation performed by the Segment and Sequence Project operators.
Since we deprived it of any other key source, the optimizer uniquely numbered the rows defined in the VALUES
clause. How cool is that?
Thanks for reading.