Monday, 27 January 2014

StarJoinInfo in SQL Server Execution Plans

Introduction

From time to time, you might notice that one or more joins in an execution plan is annotated with a StarJoinInfo structure. The official showplan schema has the following to say about this plan element:

<xsd:complexType name="StarJoinInfoType">
<xsd:annotation>
<xsd:documentation>Additional information about Star Join structure.</xsd:documentation>
</xsd:annotation>
<xsd:attribute name="Root" type="xsd:boolean" use="optional"/>
<xsd:attribute name="OperationType" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Fetch"/>
<xsd:enumeration value="Index Intersection"/>
<xsd:enumeration value="Index Filter"/>
<xsd:enumeration value="Index Lookup"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>

The inline documentation (“Additional information about Star Join structure”) is not all that enlightening, though the other details are quite intriguing. I will cover those in detail.

If you consult your favourite search engine for more information using terms like “SQL Server star join optimization”, you are likely to see results describing optimized bitmap filters. This is a separate Enterprise Edition feature introduced in SQL Server 2008 and not related to the StarJoinInfo structure at all.

Selective Star Queries

The presence of StarJoinInfo indicates that SQL Server applied one of a set of optimizations targeted at selective star schema queries. These optimizations are available from SQL Server 2005 in all editions (not just Enterprise). Note that selective here refers to the number of rows fetched from the fact table. The combination of dimensional predicates in a query may still be selective even where the individual predicates qualify a large number of rows.

Index intersection

The query optimizer may consider combining multiple nonclustered indexes where a suitable single index does not exist, as the following AdventureWorks query demonstrates:

SELECT
    COUNT_BIG(*)
FROM Sales.SalesOrderHeader
WHERE
    SalesPersonID = 276
    AND CustomerID = 29522;

The optimizer determines that combining two nonclustered indexes (one on SalesPersonID and the other on CustomerID) is the cheapest way to satisfy this query (there is no single index covering both columns):

Index Intersection

Each index seek returns the clustered index key for rows that satisfy the predicate. The join matches the returned keys to ensure that only rows that match both predicates are passed on.

If the table were a heap, each seek would return heap row identifiers (RIDs) instead of clustered index keys, but the overall strategy is the same: Find row identifiers for each predicate then match them up.

Manual Index Intersection

The same idea can be extended to queries that select rows from a fact table using predicates applied to dimension tables.

To see how this works, consider the following query (using the Contoso BI sample database) to find the total sales amount for MP3 players sold in Contoso stores with exactly 50 employees:

SELECT
    SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
WHERE
    DS.EmployeeCount = 50
    AND DP.ProductName LIKE N'%MP3%';

For comparison with later efforts, this (very selective) query produces a query plan like the following:

Default query plan

That plan has an estimated cost of just over 15.6 units. It features parallel execution with a full scan of the fact table (albeit with a bitmap filter applied).

The fact tables in this sample database do not include nonclustered indexes on the fact table foreign keys by default, so we need to add a couple:

CREATE INDEX ix_ProductKey ON dbo.FactSales (ProductKey);
CREATE INDEX ix_StoreKey ON dbo.FactSales (StoreKey);

With these indexes in place, we can start to see how index intersection can be used to improve efficiency.

The first step is to find fact table row identifiers for each separate predicate. The following queries apply a single dimension predicate then join back to the fact table to find row identifiers (fact table clustered index keys):

-- Product dimension predicate
SELECT
    FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
WHERE
    DP.ProductName LIKE N'%MP3%';

-- Store dimension predicate
SELECT
    FS.SalesKey
FROM dbo.FactSales AS FS
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
WHERE
    DS.EmployeeCount = 50;

The query plans show a scan of the small dimension table followed by lookups using the fact table nonclustered index to find row identifiers (remember nonclustered indexes always include the base table clustering key or heap RID):

Dimension Lookup Plans

The intersection of these two sets of fact table clustered index keys identifies the rows that should be returned by the original query. Once we have these row identifiers, we just need to look up the SalesAmount in each fact table row and compute the sum.

Intersection query

Putting all that together in a query gives the following:

SELECT SUM(FS.SalesAmount)
FROM
(
    -- Product dimension predicate
    SELECT
        FS.SalesKey
    FROM dbo.FactSales AS FS
    JOIN dbo.DimProduct AS DP
        ON DP.ProductKey = FS.ProductKey
    WHERE
        DP.ProductName LIKE N'%MP3%'

    INTERSECT

    -- Store dimension predicate
    SELECT
        FS.SalesKey
    FROM dbo.FactSales AS FS
    JOIN dbo.DimStore AS DS
        ON DS.StoreKey = FS.StoreKey
    WHERE
        DS.EmployeeCount = 50
) AS Keys
JOIN dbo.FactSales AS FS WITH (FORCESEEK)
    ON FS.SalesKey = Keys.SalesKey
OPTION (MAXDOP 1);

The FORCESEEK hint is there to ensure we get point lookups to the fact table. Without this, the optimizer chooses to scan the fact table, which is exactly what we are looking to avoid. The MAXDOP 1 hint just helps keep the final plan to a fairly reasonable size for display purposes:

Manual Index Intersection

The component parts of the manual index intersection plan are quite easy to identify. The two fact table nonclustered index lookups on the right hand side produce the two sets of fact table row identifiers. The hash join finds the intersection of these two sets. The clustered index seek into the fact table finds the SalesAmount for these row identifiers. Finally, the Stream Aggregate computes the total amount.

This plan performs relatively few lookups into the fact table nonclustered and clustered indexes. If the query is selective enough, this might well be a cheaper execution strategy than scanning the fact table completely. The Contoso BI sample database is relatively small, with only 3.4 million rows in the sales fact table. For larger fact tables, the difference between a full scan and a few hundred lookups could be very significant. Unfortunately, the manual rewrite introduces some serious cardinality errors, resulting in a plan with an estimated cost of 46.5 units.

Automatic Intersection with Lookups

Luckily, we do not have to decide if the query we are writing is selective enough to justify this manual rewrite. The star join optimizations for selective queries mean the query optimizer can explore this option for us, using the more user-friendly original query syntax:

SELECT
    SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
WHERE
    DS.EmployeeCount = 50
    AND DP.ProductName LIKE N'%MP3%';

The optimizer produces the following execution plan with an estimated cost of 1.64 units:

Optimizer Plan

The differences between this plan and the manual version are:

The index intersection is an inner join instead of a semi join
The clustered index lookup is shown as a Key Lookup instead of a Clustered Index Seek.

Remember, if the fact table were a heap, the Key Lookup would be an RID Lookup.

StarJoinInfo Properties

The joins in this plan all have a StarJoinInfo structure. To see it, click on a join iterator and look in the SSMS Properties window. Click on the arrow to the left of the StarJoinInfo element to expand the node.

The nonclustered fact table joins on the right of the plan are Index Lookups built by the optimizer:

Index Lookup

The hash join has a StarJoinInfo structure showing it is performing an Index Intersection (again, manufactured by the optimizer):

Index Intersection

The StarJoinInfo for the leftmost Nested Loops join shows it was generated to fetch fact table rows by row identifier. It is at the root of the optimizer-generated star join subtree:

Root and Fetch

Cartesian Product with Lookup

The index intersection plans considered as part of the star join optimizations are useful for for selective fact table queries where single-column nonclustered indexes exist on fact table foreign keys (a common design practice).

It sometimes also makes sense to create multi-column indexes on fact table foreign keys, for frequently-queried combinations. The built-in selective star query optimizations contain a rewrite for this scenario too. To see how this works, add the following multi-column index to the fact table:

CREATE INDEX ix_ProductKey_StoreKey 
ON dbo.FactSales (ProductKey, StoreKey);

Compile the test query again:

SELECT
    SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
WHERE
    DS.EmployeeCount = 50
    AND DP.ProductName LIKE N'%MP3%';

The query plan no longer features index intersection:

Cartesian Product

The strategy chosen here is:

Apply each predicate to the dimension tables.
Take the cartesian product of the results.
Use that to seek into both keys of the multi-column index.
Performs a Key Lookup into the fact table using row identifiers (as seen previously).

This plan is particularly interesting because it combines three features that are often regarded as bad things (full scan, cartesian product, and key lookup) in a performance optimization. This is a valid strategy when the product of the two dimensions is expected to be very small.

There is no StarJoinInfo for the cartesian product, but the other joins do have that information:

Index Lookup on Cartesian Product

Index Filter

Referring back to the showplan schema, there is one other StarJoinInfo operation we need to cover:

<xsd:complexType name="StarJoinInfoType">
<xsd:annotation>
<xsd:documentation>Additional information about Star Join structure.</xsd:documentation>
</xsd:annotation>
<xsd:attribute name="Root" type="xsd:boolean" use="optional"/>
<xsd:attribute name="OperationType" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Fetch"/>
<xsd:enumeration value="Index Intersection"/>
<xsd:enumeration value="Index Filter"/>
<xsd:enumeration value="Index Lookup"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>

The Index Filter value is seen with joins that are considered selective enough to be worth performing before the fact table fetch. Joins that are not selective enough will be performed after the fetch and will not have a StarJoinInfo structure.

To see an Index Filter using our test query, we need to add a third join table to the mix, remove the nonclustered fact table indexes created so far, and add a new one:

CREATE INDEX ix_ProductKey_StoreKey_PromotionKey
ON dbo.FactSales (ProductKey, StoreKey, PromotionKey);

SELECT
    SUM(FS.SalesAmount)
FROM dbo.FactSales AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
JOIN dbo.DimPromotion AS DPR 
    ON DPR.PromotionKey = FS.PromotionKey
WHERE
    DS.EmployeeCount = 50
    AND DP.ProductName LIKE N'%MP3%'
    AND DPR.DiscountPercent <= 0.1;

The query plan is now:

Index Filter

Heap Index Intersection

For completeness, here is a script to create a heap copy of the fact table with the two nonclustered indexes needed to enable the index intersection optimizer rewrite:

SELECT * INTO FS FROM dbo.FactSales;

CREATE INDEX i1 ON dbo.FS (ProductKey);
CREATE INDEX i2 ON dbo.FS (StoreKey);

SELECT
    SUM(FS.SalesAmount)
FROM FS AS FS
JOIN dbo.DimProduct AS DP
    ON DP.ProductKey = FS.ProductKey
JOIN dbo.DimStore AS DS
    ON DS.StoreKey = FS.StoreKey
WHERE
    DS.EmployeeCount <= 10
    AND DP.ProductName LIKE N'%MP3%';

The execution plan for this query has the same features as before, but the index intersection is performed using RIDs instead of fact table clustered index keys and the final fetch is an RID Lookup:

RID Index Intersection

Summary

The optimizer rewrites shown here are targeted at queries that return a relatively small number of rows from a large fact table. These rewrites have been available in all editions of SQL Server since 2005.

Although intended to speed up selective star (and snowflake) schema queries in data warehousing, the optimizer may apply these techniques wherever it detects a suitable set of tables and joins. The heuristics used to detect star queries are quite broad, so you may encounter plan shapes with StarJoinInfo structures in just about any type of database. Any table of a reasonable size (say 100 pages or more) with references to smaller (dimension-like) tables is a potential candidate for these optimizations (note that explicit foreign keys are not required).

For those of you that enjoy such things, the optimizer rule responsible for generating selective star join patterns from a logical n-table join is called StarJoinToIdxStrategy (star join to index strategy).

Thanks for reading.