Friday, 23 August 2019

Batch Mode Bitmaps in SQL Server

Background

In traditional row-mode execution plans, SQL Server may introduce a Bitmap operator as part of performing early semi join reduction before a parallel hash or merge join. The bitmap is constructed from the build input, and used to filter rows on the probe input before they reach the join. I have written about row-mode bitmaps before and they are also covered in the documentation.

This article is about batch mode bitmaps, which have a very different implementation. There have been major improvements since the first appearance of the batch mode execution engine in SQL Server 2012. The details given here apply to SQL Server 2017—the most recently released version at the time of writing. Features specific to earlier versions will mentioned inline as we go along.

The Query Optimizer

The only join operator that can run in batch mode is hash join. The query optimizer decides whether a batch mode hash join (serial or parallel) should have a bitmap or not. The optimizer assesses the potential usefulness of a bitmap by computing the selectivity of a hypothetical semi join of the hash join inputs on the join key(s).

A semi join makes sense, because the function of bitmap filtering is to remove rows from the probe side that do not exist on the build side. If the estimated selectivity of the semi join is 0.75 or less, the optimizer considers it worthwhile using a batch mode bitmap filter. In other words, a bitmap is specified if the semi join calculation indicates that 25% or more of the probe-side rows would be eliminated by the bitmap.

The exact semi join selectivity is only used to determine whether the hash join should have a bitmap or not. The probe side selectivity estimate (visible as an effect on cardinality estimates) is modified to reflect the fact that bitmaps are not generally perfect in removing non-qualified rows. This is particularly the case when the bitmap is implemented using a Bloom filter, which can generate false positives (probe-side rows that pass the filter, but nevertheless do not join with the build side).

Selectivity rounding

The optimizer takes account of this uncertainty by rounding the semi join selectivity to a power of ten. It does this by taking the base-10 logarithm before rounding. For example, a semi join selectivity of 0.316 gives a log₁₀ value fractionally below -0.5, which is rounded down to -1. The resulting probe-side selectivity is 10^-1 = 0.1. By contrast, a semi join selectivity of 0.317 gives a log₁₀ value just above -0.5, which is rounded to zero. The probe-side selectivity is therefore 10⁰ = 1 (all rows pass). Possible probe-side bitmap selectivities resulting from this series of computations are 1, 0.1, 0.01, 0.001 and so on. Note that a bitmap may still be used when the estimated selectivity is rounded up to 1 (all rows pass the predicate).

Operators and cardinality estimates

There is no separate Bitmap operator for a batch mode hash join. Instead, the hash join operator either has a Bitmap Creator property (set to true), or the property is missing (not set to false). This difference between row mode and batch mode execution makes it less easy to see if a bitmap is being created, but it does more accurately reflect the underlying process: In row-mode execution, the Bitmap operator populates the bitmap as rows flow through it, one at a time, before reaching the hash join build input. In other words, the bitmap and hash table are built concurrently under row mode execution. In batch mode, the hash table is fully built before the bitmap is created (more on this shortly).

The query optimizer does not make cost-based choices about the position of a batch mode bitmap filter on the probe side of the hash join. It simply assumes that the selectivity of the bitmap will apply to all child operators on the probe side. In reality, the bitmap is only pushed down the probe side once a single final execution plan has been selected by the optimizer. If the bitmap cannot be pushed all the way down to a leaf operator, cardinality estimates will look a bit strange. This is a trade-off that might be improved in future.

The Execution Engine

While the optimizer decides whether a hash join batch mode bitmap should be used or not, the batch mode execution engine decides the type of bitmap to use at runtime. This decision is informed by statistical information gathered about the build-side join keys during the hash table build. As previously mentioned, in contrast to row-mode execution, batch-mode hash joins completely build the hash table first, before a separate pass over the hash table is performed to build the bitmap.

Simple Bitmaps

There are two main types of batch mode hash join bitmap. A simple bitmap contains one bit for each of a contiguous range of build-side values. For example, a one-byte simple bitmap is capable of indicating whether eight contiguous build-side values are present or not. The values 0 to 7 inclusive can be encoded into the bitmap by setting the corresponding bit. A value of 5 would set bit 5, which has the positional value of 2⁵. We could encode the set {1, 5, 7} as 2¹ + 2⁵ + 2⁷ = 2 + 32 + 128 = 162 = 10100010₂. A range of values that does not start at zero can be encoded the same way by offsetting all values by the minimum value present in the set (which we know from the hash table statistics).

A simple bitmap has the advantage of storing exact information about the build-side values actually present, at the cost of requiring memory proportional to the range of values present.

Complex Bitmaps

The second type of bitmap is a Bloom filter. This sets one or more bits in the bitmap structure according to the result of applying one or more hash functions to each build-side value. We test for matches by applying the same hash functions to each probe-side value. If any of the hash functions identify a bit that is not set, we can be sure the current probe value does not appear on the build side.

Since a Bloom filter is a probabilistic structure, we might not filter out some probe values that do not have a build-side match (false positive), but we will never filter out a value that is present (false negative). In other words, a Bloom filter tells us either “maybe a match” or “definitely not a match”. The rate of false positives can be reduced either by using more bits per key (a larger bitmap) or more hash functions. To be clear, a Bloom filter may produce exact filtering, but it also might not.

Bloom filters present a space- and time-efficient alternative when a simple bitmap is infeasible due to space requirements. The SQL Server batch mode implementation of a Bloom filter is optimized for modern CPU cache architectures and is known internally as a complex bitmap. Complex bitmaps have supported multiple join columns and all batch mode data types since SQL Server 2014.

Bitmap Choice

Whether SQL Server chooses a simple or complex (Bloom filter) bitmap depends on the range of build-side values (from hash table statistics). There is an important caveat to the word “range” there, which will be explained in the next section. Meanwhile, this is how the execution engine chooses the type of bitmap:

A small simple bitmap is used when the range of values is 2²³ (8,388,608) or less. This is the number of bits in 1MB.
When the range of values is more than 2²³ but less than or equal to 2²⁶ (8MB), the execution engine chooses between a large simple bitmap and a complex bitmap by calculating the required memory for each option and choosing the smaller one. A large simple bitmap can be up to 8MB in size. The size of a complex bitmap depends on the number of bits per key needed to adequately represent the data. More distinct values normally means a bigger complex bitmap.
If the range of values is beyond 2²⁶ bits, a complex bitmap is used.

About the range of values

The range of values referred to above is based on the internal binary representation of the data. For example, the smallint values 128 and 255 might be represented as 0x0080 and 0x00FF, giving a range of 0x7F (127) binary values. This range would work well with a small simple bitmap.

On the other hand, the datetime values “1900-01-01” and “1900-01-02” might be represented in 8 bytes as 0x0000000100000000 and 0x0000000200000000 (four bytes for days and four bytes for ticks after midnight). This segmented representation gives a binary range of 0x0100000000 (4,294,967,296) for those two example values, which is far too large to fit in a simple bitmap (even a large one). Data types with complex binary representations (e.g. byte-swapping) typically will not work well with simple bitmaps.

A further complication is that batch mode data is normalized—converted to a binary layout that works well with vectorized instructions—and is always 64 bits in size. Values that cannot fit in 64 bits are stored “off row” in a dictionary, with an in-row 64-bit pointer. This binary layout is not the same as reported by a T-SQL conversion to a binary type by the way.

Nevertheless, the normalized layout is similar enough for integer types (e.g. integer and bigint) that simple bitmaps are still useful for ranges of these data types. Other types with an integer-like binary representation (e.g. money and date) are also suitable.

One more example: A set of integer values in the range 1 to 8,388,608 will just fit in a 1MB small simple bitmap, using one bit per possible integer value in the range. The range may have a fixed offset, so an integer range of 10,000,000 to 18,388,607 would also fit (with an offset of ten million). Note that the number of values present does not matter, just the range. Two values of 0 and 8,388,607 will fill the small simple bitmap range just as well as a set of all possible values between those two extremes.

Range Tables

When the batch mode execution engine decides to build a large simple bitmap or a complex bitmap for a hash join, it also builds a range table. It does not build a range table for small simple bitmaps.

The range table is a fixed-size 128KB structure made up of 8,192 pairs of 8-byte values specifying a (low, high) range of values present in the hash table. A range table can be built on any data type compatible with batch mode execution. During the scan of the finished hash table, each batch of data is used to populate the bitmap and the range table.

For each value in the batch, a hash function is used to locate a bucket (pair of range values) in the range table. If the bucket is currently unused, the value is stored in both low and high 8-byte values. If the bucket is already in use, the next bucket is filled instead (and so on, until an empty bucket is located).

If the range table becomes one-third full (2,730 of 8,192 buckets used), the used entries are copied out, the bucket range is doubled, then the saved values are reinserted in the same way as before (though the hash function takes account of the new bucket size). Any overlapping buckets are merged, and the process of doubling continues until the number of used buckets falls below 2,730. For example, two buckets that initially contain the ‘ranges’ (1-1) and (2-2) can merge into a single range bucket of (1-2) after the first range doubling.

Once all batches of hash table data have been processed into the range table, a final bucket merge is performed, then an in-place quicksort puts the buckets in value order. This allows for a binary search to locate a range bucket given a particular value of interest.

The net result of all this activity is to produce a set of up to 2,730 distinct ranges describing the data in the hash table (in addition to the large simple or complex bitmap).

Using the Range Table

The range table is used when a hash join bitmap filter is pushed down into a probe-side columnstore scan operator. Each columnstore segment has catalogue metadata about the minimum and maximum data values present in the segment. The execution engine can use this information to binary-search the bitmap range table for a match. If no match is found, the execution engine can skip the segment completely.

This runtime optimization occurs for read-ahead as well, so the engine can even avoid reading segments into memory that it knows will be eliminated by the range table filter. Any segments not eliminated by the range table are still filtered using the bitmap. This combination results in the maximum number of rows being eliminated as early as possible.

Although a range table is not built for a small simple bitmap, that structure can also be used to achieve segment elimination because the range of values is known (including any offset). It is small enough to make it not worthwhile partitioning it into sub-ranges using a range table.

When the bitmap is not pushed down into a columnstore scan, it can still be evaluated as a regular batch mode filter to achieve semi join reduction before the hash join. This is much less efficient than segment elimination or filtering during the columnstore scan, but it is still better than filtering at the hash join itself.

Bitmaps and Compressed Data

Applying a hash join batch mode bitmap to columnstore data as part of the scan can produce very good performance, but it does require impure segment data to be decompressed before the bitmap can be applied. This decompression can be performed efficiently using SIMD instructions, but it is still extra work.

SQL Server 2016 introduced the ability to create a bitmap for general predicates on dictionary-encoded segment data. The predicate is evaluated against the dictionary entries to produce a new small bitmap with each set bit indicating a dictionary entry that satisfies the predicate. Applying this bitmap to the bit-packed data can be extremely fast, especially if the bitmap fits in a single SIMD register. SQL Server can still use SIMD if the bitmap does not fit, but gathering bits from an in-memory bitmap using AVX2 is a less efficient than the in-register case.

This 2016 optimization can be applied to any predicate pushed into a columnstore scan, including a bitmap ‘predicate’ created by an upstream hash join. To be clear, SQL Server takes the hash join bitmap and creates a new (much smaller) bitmap using the dictionary entries. This new bitmap is applied to the segment data before decompression. Since the dictionary references duplicated data by definition, we also save on the number of predicate evaluations this way.

This optimization can be monitored with the extended event column_store_expression_filter_bitmap_set. When a dictionary bitmap is used, the event member filter_on_compressed_data_type member will be populated. Usually this will contain the value RAWBITMAP.

A further optimization exists to convert the compressed dictionary bitmap to a comparison filter if the dictionary bitmap represents a single contiguous range of values. In that case you will see something other than RAWBITMAP (e.g. LTGT for a less than/greater than comparison).

Extended Events and trace flags

The general ability to compile pushed-down filters (including bitmaps generated by a batch mode hash join) on a columnstore scan into a bitmap can be disabled with session-level trace flag 9361. The specific compressed-data bitmap optimization can be disabled with session-level trace flag 9362. Conversion of a dictionary bitmap with a single contiguous range to a comparison filter can be disabled with trace flag 9363. There are sadly no retail-build trace flags that report informational messages about batch mode bitmaps or pushed-down filter compilation.

There are a few extended events that produce information about hash join batch mode bitmaps. The most useful ones are:

query_execution_column_store_segment_scan_started
query_execution_column_store_segment_scan_finished
column_store_expression_filter_bitmap_set
column_store_segment_eliminate

When a hash join batch mode bitmap is pushed down into a columnstore scan, the “started” event reports BITMAP_SIMPLE or BITMAP_COMPLEX as the filter_type. It does not distinguish between small and large simple bitmaps, nor does it report anything about the range table. The extended event metadata does contain other values for column_store_filter_type that include BITMAP_SIMPLE_LARGE among other things, but the extended event does not currently produce this output when a large simple bitmap is used. Perhaps this will be corrected in a future release.

Global trace flag 646 can be used to report information about segment elimination enabled by the range table (or a small simple bitmap). It reports similar information to the segment eliminated extended event. All trace flags mentioned in this section are undocumented and unsupported.

Final Thoughts

Batch mode bitmaps can be extremely effective when the right data types (max 64 bits and integer-like) are used and the bitmap can be pushed down to a columnstore scan, especially if the segment data uses RLE encoding (pure storage) or the bitmap can be compiled into another bitmap on compressed bit-packed (impure) data.

It might be nice if execution plans reported more detailed information about hash join bitmaps—at least to say what type of bitmap was created. As it is, we only have the Bitmap Creator property, and some extended events to work with. This makes detailed plan analysis a bit harder than it ought to be, given the huge performance gains that can be realized by taking advantage of all the clever optimizations built into the execution engine for columnstore data and batch mode hash joins.

Thanks for reading.