Tuesday, 20 April 2021

How Parallel Plans Start Up—Part 3

This is the third in a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. We are now ready to start the query scan, perform some early phase processing, and start the first additional parallel tasks.

Query Scan Start Up

Recall that only the parent task exists right now, and the exchanges (parallelism operators) have only a consumer side. Still, this is enough for query execution to begin, on the parent task’s worker thread. The query processor begins execution by starting the query scan process via a call to CQueryScan::StartupQuery. A reminder of the plan:

Plan with node ids

This is the first point in the process so far that an in-flight execution plan is available (SQL Server 2016 SP1 onward) in sys.dm_exec_query_statistics_xml. There is nothing particularly interesting to see in such a plan at this point, because all the transient counters are zero, but the plan is at least available. There is no hint that parallel tasks have not been created yet, or that the exchanges lack a producer side. The plan looks ‘normal’ in all respects.

Parallel Plan Branches

Since this is a parallel plan, it will be useful to show it broken up into branches. These are shaded below, and labelled as branches A to D:

Parallel branches

Branch A is associated with the parent task, running on the worker thread provided by the session. Additional parallel workers will be started to run the additional parallel tasks contained in branches B, C, and D. Those branches are parallel, so there will be DOP additional tasks and workers in each one.

Our example query is running at DOP 2, so branch B will get two additional tasks. The same goes for branch C and branch D, giving a total of six additional tasks. Each task will run on its own worker thread in its own execution context.

Two schedulers (S₁ and S₂) are assigned to this query to run additional parallel workers. Each additional worker will run on one of those two schedulers. The parent worker may run on a different scheduler, so our DOP 2 query may use a maximum of three processor cores at any one moment in time.

To summarize, our plan will eventually have:

Branch A (parent)
- Parent task.
- Parent worker thread.
- Execution context zero.
- Any single scheduler available to the query.
Branch B (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.
Branch C (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.
Branch D (additional)
- Two additional tasks.
- An additional worker thread bound to each new task.
- Two new execution contexts, one for each new task.
- One worker thread runs on scheduler S₁. The other runs on scheduler S₂.

The question is how all these extra tasks, workers, and execution contexts are created, and when they start running.

Starting Sequence

The sequence in which additional tasks start executing for this particular plan is:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

That may not be the start-up order you were expecting.

There may be a significant delay between each of these steps, for reasons we will explore shortly. The key point at this stage is that the additional tasks, workers, and execution contexts are not all created at once, and they do not all start executing at the same time.

SQL Server could have been designed to start all the extra parallel bits all at once. That might be easy to comprehend, but it wouldn’t be very efficient in general. It would maximize the number of additional threads and other resources used by the query, and result in a great deal of unnecessary parallel waits.

With the design employed by SQL Server, parallel plans will often use fewer total worker threads than (DOP multiplied by the total number of branches). This is achieved by recognizing that some branches can run to completion before some other branch needs to start. This can allow reuse of threads within the same query, and generally reduces resource consumption overall.

Let’s now turn to the details of how our parallel plan starts up.

1. Opening Branch A

The query scan starts executing with the parent task calling Open() on the iterator at the root of the tree. This is the start of the execution sequence:

Branch A (parent task).
Branch C (additional parallel tasks).
Branch D (additional parallel tasks).
Branch B (additional parallel tasks).

We are executing this query with an ‘actual’ plan requested, so the root iterator is not the sequence project operator at node 0. Rather, it is the invisible profiling iterator that records runtime metrics in row mode plans.

The illustration below shows the query scan iterators in Branch A of the plan, with the position of invisible profiling iterators represented by the ‘spectacles’ icons.

Branch A with profilers

Execution starts with a call to open the first profiler, CQScanProfileNew::Open. This sets the open time for the child sequence project operator via the operating system’s Query Performance Counter API.

We can see this number in sys.dm_exec_query_profiles:

Opening the sequence project

The entries there may have the operator names listed, but the data comes from the profiler above the operator, not the operator itself.

As it happens, a sequence project (CQScanSeqProjectNew) does not need to do any work when opened, so it does not actually have an Open() method. The profiler above the sequence project is called, so an open time for the sequence project is recorded in the DMV.

The profiler’s Open method does not call Open on the sequence project (since it doesn’t have one). Instead it calls Open on the profiler for the next iterator in sequence. This is the segment iterator at node 1. That sets the open time for the segment, just as the prior profiler did for the sequence project:

Opening the segment

A segment iterator does have things to do when opened, so the next call is to CQScanSegmentNew::Open. Once the segment has done what it needs to, it calls the profiler for the next iterator in sequence—the consumer side of the gather streams exchange at node 2:

Opening the gather streams exchange

The next call down the query scan tree in the opening process is CQScanExchangeNew::Open, which is where things start to get more interesting.

Opening the gather streams exchange

Asking the consumer side of the exchange to open:

Opens a local (parallel nested) transaction (CXTransLocal::Open). Every process needs a containing transaction, and additional parallel tasks are no exception. They can’t share the parent (base) transaction directly, so nested transactions are used. When a parallel task needs to access the base transaction, it synchronizes on a latch, and may encounter NESTING_TRANSACTION_READONLY or NESTING_TRANSACTION_FULL waits.
Registers the current worker thread with the exchange port (CXPort::Register).
Synchronizes with other threads on the consumer side of the exchange (sqlmin!CXTransLocal::Synchronize). There are no other threads on the consumer side of a gather streams, so this is essentially a no-op on this occasion.

“Early Phases” Processing

The parent task has now reached the edge of Branch A. The next step is particular to row mode parallel plans: The parent task continues execution by calling CQScanExchangeNew::EarlyPhases on the gather streams exchange iterator at node 2. This is an additional iterator method beyond the usual Open, GetRow, and Close methods that many of you will be familiar with. EarlyPhases is only called in row mode parallel plans.

I want to be clear about something at this point: The producer side of the gather streams exchange at node 2 has not been created yet, and no additional parallel tasks have been created. We are still executing code for the parent task, using the only thread running right now.

Not all iterators implement EarlyPhases, because not all of them have anything special to do at this point in row mode parallel plans. This is analogous to the sequence project not implementing the Open method because it has nothing to do at that time. The main iterators with EarlyPhases methods are:

CQScanConcatNew (concatenation).
CQScanMergeJoinNew (merge join).
CQScanSwitchNew (switch).
CQScanExchangeNew (parallelism).
CQScanNew (rowset access e.g. scans and seeks).
CQScanProfileNew (invisible profilers).
CQScanLightProfileNew (invisible lightweight profilers).

Branch B early phases

The parent task continues by calling EarlyPhases on child operators beyond the gather streams exchange at node 2. A task moving over a branch boundary might seem unusual, but remember execution context zero contains the whole serial plan, with exchanges included. Early phase processing is about initializing parallelism, so it doesn’t count as execution per se.

To help you keep track, the picture below shows the iterators in the Branch B of the plan:

Branch B with profilers

Remember, we are still in execution context zero, so I am only referring to this as Branch B for convenience. We have not started any parallel execution yet.

The sequence of early phase code invocations in Branch B is:

CQScanProfileNew::EarlyPhases for the profiler above node 3.
CQScanMergeJoinNew::EarlyPhases at the node 3 merge join.
CQScanProfileNew::EarlyPhases for the profiler above node 4. The node 4 stream aggregate itself does not have an early phases method.
CQScanProfileNew::EarlyPhases on the profiler above node 5.
CQScanExchangeNew::EarlyPhases for the repartition streams exchange at node 5.

Notice we are only processing the outer (upper) input to the merge join at this stage. This is just the normal row mode execution iterative sequence. It is not particular to parallel plans.

Branch C early phases

Early phase processing continues with the iterators in Branch C:

Branch C with profilers

The sequence of calls here is:

CQScanProfileNew::EarlyPhases for the profiler above node 6.
CQScanProfileNew::EarlyPhases for the profiler above node 7.
CQScanProfileNew::EarlyPhases on the profiler above node 9.
CQScanNew::EarlyPhases for the index seek at node 9.

There is no EarlyPhases method on the stream aggregate or sort. The work performed by compute scalar at node 8 is deferred (to the sort), so it does not appear in the query scan tree, and does not have an associated profiler.

About profiler timings

Parent task early phase processing began at the gather streams exchange at node 2. It descended the query scan tree, following the outer (upper) input to the merge join, all the way down to the index seek at node 9. Along the way, the parent task has called the EarlyPhases method on every iterator that supports it.

None of the early phases activity has so far updated any times in the profiling DMV. Specifically, none of the iterators touched by early phases processing have had their ‘open time’ set. This makes sense, because early phase processing is just setting up parallel execution—these operators will be opened for execution later on.

The index seek at node 9 is a leaf node—it has no children. The parent task now begins returning from the nested EarlyPhases calls, ascending the query scan tree back towards the gather streams exchange.

Each of the profilers calls the Query Performance Counter API on entry to their EarlyPhases method, and they call it again on the way out. The difference between the two numbers represents elapsed time for the iterator and all its children (since the method calls are nested).

After the profiler for the index seek returns, the profiler DMV shows elapsed and CPU time for the index seek only, as well as an updated last active time. Notice also that this information is recorded against the parent task (the only option right now):

Updated index seek timings

None of the prior iterators touched by the early phases calls have elapsed times, or updated last active times. These numbers are only updated when we ascend the tree.

After the next profiler early phases call return, the sort times are updated:

Updated sort timings

The next return takes us up past the profiler for the stream aggregate at node 6:

Updated stream aggregate timings

Returning from this profiler takes us back to the EarlyPhases call at the repartition streams exchange at node 5. Remember that this is not where the sequence of early phases calls started—that was the gather streams exchange at node 2.

Branch C Parallel Tasks Enqueued

Aside from updating profiling data, the prior early phases calls didn’t seem to do very much. That all changes with the repartition streams exchange at node 5.

I am going to describe Branch C in a fair amount of detail to introduce a number of important concepts, which will apply to the other parallel branches as well. Covering this ground once now means later branch discussion can be more succinct.

Having completed nested early phase processing for its subtree (down to the index seek at node 9), the exchange can begin its own early phase work. This starts out the same as opening the gather streams exchange at node 2:

CXTransLocal::Open (opening the local parallel sub-transaction).
CXPort::Register (registering with the exchange port).

The next steps are different because branch C contains a fully blocking iterator (the sort at node 7). The early phase processing at the node 5 repartition streams does the following:

Calls CQScanExchangeNew::StartAllProducers. This is the first time we have encountered anything referencing the producer side of the exchange. Node 5 is the first exchange in this plan to create its producer side.
Acquires a mutex so no other thread can be enqueueing tasks at the same time.
Starts parallel nested transactions for the producer tasks (CXPort::StartNestedTransactions and ReadOnlyXactImp::BeginParallelNestedXact).
Registers the sub-transactions with the parent query scan object (CQueryScan::AddSubXact).
Creates producer descriptors (CQScanExchangeNew::PxproddescCreate).
Creates new producer execution contexts (CExecContext) derived from execution context zero.
Updates the linked map of plan iterators.
Sets DOP for the new context (CQueryExecContext::SetDop) so all tasks know what the overall DOP setting is.
Initializes the parameter cache (CQueryExecContext::InitParamCache).
Links the parallel nested transactions to the base transaction (CExecContext::SetBaseXact).
Queues the new sub-processes for execution (SubprocessMgr::EnqueueMultipleSubprocesses).
Creates new parallel tasks tasks via sqldk!SOS_Node::EnqueueMultipleTasksDirect.

The parent task’s call stack (for those of you that enjoy these things) around this point in time is:

Parent task call stack

End of Part 3

We have now created the producer side of the repartition streams exchange at node 5, created additional parallel tasks to run Branch C, and linked everything back to parent structures as required. Branch C is the first branch to start any parallel tasks. The final part of this series will look at branch C opening in detail, and start the remaining parallel tasks.