StarJoinInfo in SQL Server Execution Plans
Introduction
From time to time, you might notice that one or more joins in an execution plan is annotated with a StarJoinInfo
structure. The official showplan schema has the following to say about this plan element:
<xsd:complexType name="StarJoinInfoType">
<xsd:annotation>
<xsd:documentation>Additional information about Star Join structure.</xsd:documentation>
</xsd:annotation>
<xsd:attribute name="Root" type="xsd:boolean" use="optional"/>
<xsd:attribute name="OperationType" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Fetch"/>
<xsd:enumeration value="Index Intersection"/>
<xsd:enumeration value="Index Filter"/>
<xsd:enumeration value="Index Lookup"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
The inline documentation (“Additional information about Star Join structure”) is not all that enlightening, though the other details are quite intriguing. I will cover those in detail.
If you consult your favourite search engine for more information using terms like “SQL Server star join optimization”, you are likely to see results describing optimized bitmap filters. This is a separate Enterprise Edition feature introduced in SQL Server 2008 and not related to the StarJoinInfo
structure at all.
Selective Star Queries
The presence of StarJoinInfo
indicates that SQL Server applied one of a set of optimizations targeted at selective star schema queries. These optimizations are available from SQL Server 2005 in all editions (not just Enterprise). Note that selective here refers to the number of rows fetched from the fact table. The combination of dimensional predicates in a query may still be selective even where the individual predicates qualify a large number of rows.
Index intersection
The query optimizer may consider combining multiple nonclustered indexes where a suitable single index does not exist, as the following AdventureWorks query demonstrates:
SELECT
COUNT_BIG(*)
FROM Sales.SalesOrderHeader
WHERE
SalesPersonID = 276
AND CustomerID = 29522;
The optimizer determines that combining two nonclustered indexes (one on SalesPersonID
and the other on CustomerID
) is the cheapest way to satisfy this query (there is no single index covering both columns):
Each index seek returns the clustered index key for rows that satisfy the predicate. The join matches the returned keys to ensure that only rows that match both predicates are passed on.
If the table were a heap, each seek would return heap row identifiers (RIDs) instead of clustered index keys, but the overall strategy is the same: Find row identifiers for each predicate then match them up.
Manual Index Intersection
The same idea can be extended to queries that select rows from a fact table using predicates applied to dimension tables.
To see how this works, consider the following query (using the Contoso BI sample database) to find the total sales amount for MP3 players sold in Contoso stores with exactly 50 employees:
SELECT
SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount = 50
AND DP.ProductName LIKE N'%MP3%';
For comparison with later efforts, this (very selective) query produces a query plan like the following:
That plan has an estimated cost of just over 15.6 units. It features parallel execution with a full scan of the fact table (albeit with a bitmap filter applied).
The fact tables in this sample database do not include nonclustered indexes on the fact table foreign keys by default, so we need to add a couple:
CREATE INDEX ix_ProductKey ON dbo.FactSales (ProductKey);
CREATE INDEX ix_StoreKey ON dbo.FactSales (StoreKey);
With these indexes in place, we can start to see how index intersection can be used to improve efficiency.
The first step is to find fact table row identifiers for each separate predicate. The following queries apply a single dimension predicate then join back to the fact table to find row identifiers (fact table clustered index keys):
-- Product dimension predicate
SELECT
FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
WHERE
DP.ProductName LIKE N'%MP3%';
-- Store dimension predicate
SELECT
FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount = 50;
The query plans show a scan of the small dimension table followed by lookups using the fact table nonclustered index to find row identifiers (remember nonclustered indexes always include the base table clustering key or heap RID):
The intersection of these two sets of fact table clustered index keys identifies the rows that should be returned by the original query. Once we have these row identifiers, we just need to look up the SalesAmount
in each fact table row and compute the sum.
Intersection query
Putting all that together in a query gives the following:
SELECT SUM(FS.SalesAmount)
FROM
(
-- Product dimension predicate
SELECT
FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
WHERE
DP.ProductName LIKE N'%MP3%'
INTERSECT
-- Store dimension predicate
SELECT
FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount = 50
) AS Keys
JOIN dbo.FactSales AS FS WITH (FORCESEEK)
ON FS.SalesKey = Keys.SalesKey
OPTION (MAXDOP 1);
The FORCESEEK
hint is there to ensure we get point lookups to the fact table. Without this, the optimizer chooses to scan the fact table, which is exactly what we are looking to avoid. The MAXDOP 1
hint just helps keep the final plan to a fairly reasonable size for display purposes:
The component parts of the manual index intersection plan are quite easy to identify. The two fact table nonclustered index lookups on the right hand side produce the two sets of fact table row identifiers. The hash join finds the intersection of these two sets. The clustered index seek into the fact table finds the SalesAmount
for these row identifiers. Finally, the Stream Aggregate computes the total amount.
This plan performs relatively few lookups into the fact table nonclustered and clustered indexes. If the query is selective enough, this might well be a cheaper execution strategy than scanning the fact table completely. The Contoso BI sample database is relatively small, with only 3.4 million rows in the sales fact table. For larger fact tables, the difference between a full scan and a few hundred lookups could be very significant. Unfortunately, the manual rewrite introduces some serious cardinality errors, resulting in a plan with an estimated cost of 46.5 units.
Automatic Intersection with Lookups
Luckily, we do not have to decide if the query we are writing is selective enough to justify this manual rewrite. The star join optimizations for selective queries mean the query optimizer can explore this option for us, using the more user-friendly original query syntax:
SELECT
SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount = 50
AND DP.ProductName LIKE N'%MP3%';
The optimizer produces the following execution plan with an estimated cost of 1.64 units:
The differences between this plan and the manual version are:
- The index intersection is an inner join instead of a semi join
- The clustered index lookup is shown as a Key Lookup instead of a Clustered Index Seek.
Remember, if the fact table were a heap, the Key Lookup would be an RID Lookup.
StarJoinInfo Properties
The joins in this plan all have a StarJoinInfo
structure. To see it, click on a join iterator and look in the SSMS Properties window. Click on the arrow to the left of the StarJoinInfo
element to expand the node.
The nonclustered fact table joins on the right of the plan are Index Lookups built by the optimizer:
The hash join has a StarJoinInfo
structure showing it is performing an Index Intersection (again, manufactured by the optimizer):
The StarJoinInfo
for the leftmost Nested Loops join shows it was generated to fetch fact table rows by row identifier. It is at the root of the optimizer-generated star join subtree:
Cartesian Product with Lookup
The index intersection plans considered as part of the star join optimizations are useful for for selective fact table queries where single-column nonclustered indexes exist on fact table foreign keys (a common design practice).
It sometimes also makes sense to create multi-column indexes on fact table foreign keys, for frequently-queried combinations. The built-in selective star query optimizations contain a rewrite for this scenario too. To see how this works, add the following multi-column index to the fact table:
CREATE INDEX ix_ProductKey_StoreKey
ON dbo.FactSales (ProductKey, StoreKey);
Compile the test query again:
SELECT
SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount = 50
AND DP.ProductName LIKE N'%MP3%';
The query plan no longer features index intersection:
The strategy chosen here is:
- Apply each predicate to the dimension tables.
- Take the cartesian product of the results.
- Use that to seek into both keys of the multi-column index.
- Performs a Key Lookup into the fact table using row identifiers (as seen previously).
This plan is particularly interesting because it combines three features that are often regarded as bad things (full scan, cartesian product, and key lookup) in a performance optimization. This is a valid strategy when the product of the two dimensions is expected to be very small.
There is no StarJoinInfo
for the cartesian product, but the other joins do have that information:
Index Filter
Referring back to the showplan schema, there is one other StarJoinInfo
operation we need to cover:
<xsd:complexType name="StarJoinInfoType">
<xsd:annotation>
<xsd:documentation>Additional information about Star Join structure.</xsd:documentation>
</xsd:annotation>
<xsd:attribute name="Root" type="xsd:boolean" use="optional"/>
<xsd:attribute name="OperationType" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Fetch"/>
<xsd:enumeration value="Index Intersection"/>
<xsd:enumeration value="Index Filter"/>
<xsd:enumeration value="Index Lookup"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
The Index Filter
value is seen with joins that are considered selective enough to be worth performing before the fact table fetch. Joins that are not selective enough will be performed after the fetch and will not have a StarJoinInfo
structure.
To see an Index Filter
using our test query, we need to add a third join table to the mix, remove the nonclustered fact table indexes created so far, and add a new one:
CREATE INDEX ix_ProductKey_StoreKey_PromotionKey
ON dbo.FactSales (ProductKey, StoreKey, PromotionKey);
SELECT
SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
JOIN dbo.DimPromotion AS DPR
ON DPR.PromotionKey = FS.PromotionKey
WHERE
DS.EmployeeCount = 50
AND DP.ProductName LIKE N'%MP3%'
AND DPR.DiscountPercent <= 0.1;
The query plan is now:
Heap Index Intersection
For completeness, here is a script to create a heap copy of the fact table with the two nonclustered indexes needed to enable the index intersection optimizer rewrite:
SELECT * INTO FS FROM dbo.FactSales;
CREATE INDEX i1 ON dbo.FS (ProductKey);
CREATE INDEX i2 ON dbo.FS (StoreKey);
SELECT
SUM(FS.SalesAmount)
FROM FS AS FS
JOIN dbo.DimProduct AS DP
ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
ON DS.StoreKey = FS.StoreKey
WHERE
DS.EmployeeCount <= 10
AND DP.ProductName LIKE N'%MP3%';
The execution plan for this query has the same features as before, but the index intersection is performed using RIDs instead of fact table clustered index keys and the final fetch is an RID Lookup:
Summary
The optimizer rewrites shown here are targeted at queries that return a relatively small number of rows from a large fact table. These rewrites have been available in all editions of SQL Server since 2005.
Although intended to speed up selective star (and snowflake) schema queries in data warehousing, the optimizer may apply these techniques wherever it detects a suitable set of tables and joins. The heuristics used to detect star queries are quite broad, so you may encounter plan shapes with StarJoinInfo
structures in just about any type of database. Any table of a reasonable size (say 100 pages or more) with references to smaller (dimension-like) tables is a potential candidate for these optimizations (note that explicit foreign keys are not required).
For those of you that enjoy such things, the optimizer rule responsible for generating selective star join patterns from a logical n-table join is called StarJoinToIdxStrategy
(star join to index strategy).
Thanks for reading.