SQL University: Parallelism Week - Part 3, Settings and Options

Congratulations! You've made it back for the the third and final installment of Parallelism Week here at SQL University. So far we've covered the fundamentals of multitasking vs. parallel processing and delved into how parallel query plans actually work. Today we'll take a look at the settings and options that influence intra-query parallelism and discuss how best to set things up in various situations.

Instance-Level Configuration

Your database server probably has more than one logical processor. As a matter of fact, as of the time of this writing--May, 2010--I would be shocked to see any new piece of hardware requisitioned for use as a "database server" that contains fewer than eight cores. Processing power is relatively plentiful and thanks to SQL Server's socket-based licensing scheme and the prevalence of multi-core CPUs, it's almost ridiculously cheap to put together a server with a huge number of logical processors.

So there you are with your oh-so-powerful server. You've done your disk configuration, set up Windows appropriately, and installed SQL Server. You're almost done! But before you mark this project as finished, it's important to ponder the impact of a few instance-level settings that may impact parallelism:

Affinity Mask controls which logical processors are used by the SQL Server instance. This may not be considered a parallelism setting per se, but it's an important part of the story. Setting this option to too low a number obviously restricts what SQL Server can do in parallel. But it's important to consider this option, especially for servers running multiple SQL Server instances.
Maximum Degree of Parallelism dictates, at an instance level, the maximum number of logical processors that will be used by a given query. More on this in the next section.
Cost Threshold for Parallelism is the primary threshold used by the query optimizer in determining whether it should consider parallel plans as part of the optimization process. More on this in the How Much Does it Cost section.

I won't discuss Affinity Mask any further at this point, except to mention that the sp_configure "affinity mask" setting is listed in Books Online as a deprecated feature. It's been replaced with an option called ALTER SERVER CONFIGURATION SET PROCESS AFFINITY. As far as I can tell, the end result is the same.

98 Degrees

(Because how often, really, do you get to name a section of a technical article after what is described in Wikipedia as an "adult contemporary boy band?" And doesn't "boy band" eliminate the possibility of said band making "adult" music? Or the other way around? But I digress...)

Maximum Degree of Parallelism (MAXDOP) seems simple enough. You set it at the instance level and it controls the maximum number of logical processors that a given query can use, assuming that the query hasn't overridden the setting. But here things start getting trickier. First there's the question of what the "correct" setting actually is. And then there's the monitoring question, and the confusion around tasks vs. processor utilization. I'll clarify both of these as best I can.

How do you set the thing? The default value is "0," which means "use up to as many logical processors as have been assigned to the SQL Server instance by the Affinity Mask option." This is, in my opinion, one of a few default settings that the SQL Server team really got wrong. SQL Server instances are usually used by a lot of people at once, and one of our goals as DBAs or DB developers is to maximize concurrency. This means getting all of the queries that are running at the same time to finish as quickly as possible, and that means being able to distribute the available resources in a relatively even manner.

Processor time is one of the core resources that every query needs to use, and if a decent number of queries are each using every core simultaneously, processor time is going to quickly become the most limited resource on the system. In Part 1 of this week's series I mentioned "thrashing," and if you have 10 queries active on your system, each of which is using all 16 cores, thrashing is exactly what you'll have. Each query's tasks will spend as much--or more--time switching on and off of schedulers, and waiting to get back onto schedulers, as actually doing work. And that's clearly what you don't want to have happen in a highly concurrent environment.

A lot of articles recommend an instance-wide MAXDOP setting of 1/2 of the number of logical processors allocated to the instance. I think that's a good starting point, especially for instances supporting data warehouses and mixed workloads, but it's certainly not the final answer. Some articles I've read recommend never going above a value of "8," and I think that's too Draconian a prescription. And a lot of people running OLTP workloads like to set the value to "1." Which can also be a bit severe.

The fact is, too much parallelism will hurt concurrency, and too little will hurt the performance of big queries. So how do you figure out what setting to use? There's really only one surefire way: baselining, preferably in conjunction with load testing. Run a load test--or look at your production system's metrics if you have no test environment--and establish a CPU and query performance baseline. Change the MAXDOP setting and try again. Rinse and repeat until you find something optimal.

For OLTP workloads, a setting of "1" is, indeed, quite often a good idea for most queries. Unfortunately, this value is problematic because the MXDOP setting also controls non-query operations such as index rebuilds. You usually want your index rebuilds to process in parallel whenever possible because they'll finish more quickly that way, thereby freeing resources for your users.

Index operations and individual queries can override the instance-level MAXDOP setting by using the MAXDOP query hint. (A very well-named hint!) So on an instance with the sp_configure "maximum degree of parallelism" setting set to "1," I could tell the query optimizer to use up to 10 logical processors for the following query by using the hint:

SELECT *
FROM Tbl
WHERE
Col > 5
OPTION (MAXDOP 10)

Since you can override the setting on an individual query basis, the instance-wide value should be set in such a way as to positively influence the greater majority of queries running on the instance. But how do you find those that need special attention via hints? My friend Jimmy May pointed out the solution to me in a recent conversation: run a load test with a non-zero instance-level MAXDOP setting, baseline performance metrics for each query, then flip the value to "1," test again, and compare the results. Or the other way around. The queries which perform better using the setting you don't want to set at the instance level are the ones that need a hint. Simple enough!

"I have MAXDOP set to '4.' Why is my query spinning up 13 tasks?" As you become more familiar with how to monitor SQL Server you will invariably start looking at the sys.dm_os_tasks DMV and will quickly notice that MAXDOP does not seem to directly impact the number of tasks that a given request can use. This can be a bit confusing, since you might expect that a request using more tasks equates to a request using more logical processors. But that's not exactly how it works. In truth, a query's degree of parallelism determines how many tasks--at most--each individual iterator can spin up. And several iterators can run concurrently at any given time during a request's life cycle. The setting also determines the maximum number of schedulers all of those tasks can--by virtue of their associated workers--be bound to.

A query with a degree of parallelism of "4" that is using 13 tasks most likely has three active iterators, each of which is using four tasks. So why 13 and not 12? Because there is always a task dedicated to coordinating the other tasks. In sys.dm_os_tasks and other task-related DMVs it will be the task with an exec_context_id (execution context identifier) of "0". The others will have a non-zero identifier.

By the way, just because you ask for a given degree of parallelism doesn't meant you'll get it. The query processor may use a DOP lower than that which you requested. To find out the actual degree of parallelism for your query, take a look at the DegreeOfParallelism attribute of the QueryPlan node in the showplan XML.

How Much Does it Cost?

SQL Server uses cost-based optimization to determine how best to process your queries. This means that the optimizer considers a number of potential query plans and assigns each a cost, or a score that acts as a relative means by which to compare different plans. This number is calculated by using a fixed price for the estimated number of times every given operation that is a part of the plan is expected to take place in order to satisfy your query. This same cost-based model is used to figure out whether to process your query serially or in parallel.

Query optimization is an expensive process, so it happens in phases. Each phase takes a bit longer than the previous and during the latter phases the query optimizer considers more complex plans that are more difficult to cost. Parallel plans are certainly more complicated than serial plans, and so are reserved for the later phases of the optimization process. And this is where the Cost Threshold for Parallelism setting comes into play.

The parallel plan selection process can be described fairly simply: During the early parts of the optimization process, the optimizer considers serial plans. Should these plans exceed the set cost threshold, parallel plans may be considered at later phases of the process. If one of the parallel plans is cheaper than the serial version, it's used.

Modifying this setting can get a bit confusing, since it's difficult to see from parallel plans what impact changing the setting will have. If your parallel plan has an estimated cost of "6.58," setting the Cost Threshold to "7" may or may not cause the plan to be processed serially. What matters is not the parallel cost, but rather the cost of the alternative, serial version of the plan. And to discover that you must either change the instance-level MAXDOP setting, or use the MAXDOP query hint, to see how the estimated costs change.

The default value of this setting is "5," and tweaking it is done in a similar fashion to what I described for the instance-level MAXDOP setting--with the exception that Cost Threshold for Parallelism cannot be overridden on per-query basis. For this reason it's even more important that you get it right, and many DBAs don't touch it at all. An interesting way that it can be used is by setting it to a higher-than-default value (some articles suggest using a value of "20" or higher), as an alternative to an instance-wide MAXDOP setting of "1." This guarantees that only queries estimated to be significantly expensive will be parallelized, and index rebuilds and other non-query processes will still be able to make use of parallelism without having to be overridden.

Parallelism Inhibitors

To finish things up I would like to point out one of the most frustrating parts of the parallelism game. You have your server configured the way you think it should be, you've written your massive query, and you're ready to scale it across multiple CPUs. But alas, the optimizer just won't generate a parallel plan. What's going on?

Assuming that the query's estimated cost is higher than than Cost Threshold and that the MAXDOP setting isn't "1," you're most likely encountering one of the various features that inhibit parallelism. The list below is taken from a presentation written by Craig Freedman, a member of the SQL Server query processor team.

The following features force the entire plan to be processed serially:

All T-SQL UDFs*
CLR UDFs with data access
Miscellaneous built-ins such as:

OBJECT_ID(), ERROR_NUMBER(), @@TRANCOUNT, …

Dynamic cursors

The following features force a "serial zone", or a part of the plan to be processed serially even though other parts can be parallelized:

System table scans
Sequence functions
TOP
"Backward” scans
Recursive queries
TVFs*
Global scalar aggregate
Multi-consumer spool

* By "UDFs," Craig meant the scalar variety. And by "TVFs," he meant the multi-statement flavor. Inline TVFs do not have this issue.

If you're having problems getting your query to go parallel, take a step back. Double-check the instance-wide settings. Double-check your query. Are you doing anything on the list above? If not, break your query into smaller component parts. Can you get them to go parallel? Are other queries operating in parallel as expected?

This can be both tricky and frustrating, but be patient--you'll get there. And your patience will be rewarded. When properly leveraged, parallelism makes a huge difference, allowing some queries to perform several times better than they do when processed serially.

An End, and A Beginning

When I was approached by Jorge Segarra and asked to write a week's worth of posts, I chose a topic that I happened to be studying at the time and thought that it would be pretty easy to dash out a few articles. After investing over 12 hours in writing these three posts, I have to admit that I only covered around half of the topics I was hoping to, and some subjects at a much higher level than I originally intended.

So while this is the end of my SQL University Parallelism Week contribution, you can expect several more posts on the topic of parallelism from me in the coming months. I have some very interesting things to share, and I hope you'll keep reading. This topic, at least for me, gets more and more interesting the more I learn about it, and we've only scratched the surface.

Acknowledgments

I would like to thank the following people, each of whom who helped me in some way, either directly or indirectly, as I researched and wrote this series of posts:

Jimmy May, Paul White, Craig Freedman, Conor Cunningham, Mladen Prajdic

And thank you for reading. Enjoy, and as usual please let me know what points I can clarify.