Cardinality Estimation for a Predicate on a COUNT Expression

This article looks into selectivity and cardinality estimation for predicates on COUNT(*) expressions, as may be seen in HAVING clauses. The details are hopefully interesting in themselves. They also provide an insight into some of the general approaches and algorithms used by the cardinality estimator.

A simple example using the AdventureWorks sample database:

SELECT A.City FROM Person.[Address] AS A GROUP BY A.City HAVING COUNT_BIG(*) = 1;

We are interested in seeing how SQL Server derives an estimate for the predicate on the count expression in the HAVING clause.

Of course the HAVING clause is just syntax sugar. We could equally have written the query using a derived table, or common table-expression:

-- Derived table SELECT SQ1.City FROM ( SELECT A.City, Expr1001 = COUNT_BIG(*) FROM Person.[Address] AS A GROUP BY A.City ) AS SQ1 WHERE SQ1.Expr1001 = 1; -- CTE WITH Grouped AS ( SELECT A.City, Expr1001 = COUNT_BIG(*) FROM Person.[Address] AS A GROUP BY A.City ) SELECT G.City FROM Grouped AS G WHERE G.Expr1001 = 1;

All three query forms produce the same execution plan, with identical query plan hash values.

The post-execution (actual) plan shows a perfect estimation for the aggregate; however, the estimate for the HAVING clause filter (or equivalent, in the other query forms) is poor:

Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

Statistics on the City column provide accurate information about the number of distinct city values:

DBCC SHOW_STATISTICS ([Person.Address], City) WITH DENSITY_VECTOR;
Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

The all density figure is the reciprocal of the number of unique values. Simply calculating (1 / 0.00173913) = 575 gives the cardinality estimate for the aggregate. Grouping by city obviously produces one row for each distinct value.

Note that all density comes from the density vector. Be careful not to accidentally use the density value from the statistics header output of DBCC SHOW_STATISTICS . The header density is maintained only for backward compatibility; it is not used by the optimizer during cardinality estimation these days.

The Problem

The aggregate introduces a new computed column to the workflow, labelled Expr1001 in the execution plan. It contains the value of COUNT(*) in each grouped output row:

Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

There is obviously no statistical information in the database about this new computed column. While the optimizer knows there will be 575 rows, it knows nothing about the distribution of count values within those rows.

Well not quite nothing: The optimizer is aware that the count values will be positive integers (1, 2, 3…). Still, it is the distribution of these integer count values among the 575 rows that would be needed to accurately estimate the selectivity of the COUNT(*) = 1 predicate.

One might think that some sort of distribution information could be derived from the histogram, but the histogram only provides specific count information (in EQ_ROWS ) for histogram step values. Between histogram steps, all we have is a summary: RANGE_ROWS rows have DISTINCT_RANGE_ROWS distinct values. For tables that are large enough that we care about the quality of the selectivity estimate, it is very likely that most of the table is represented by these intra-step summaries.

For example, the first two rows of the City column histogram are:

DBCC SHOW_STATISTICS ([Person.Address], City) WITH HISTOGRAM;
Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

This tells us that there is exactly one row for "Abingdon", and 29 other rows after "Abingdon" but before "Ballard", with 19 distinct values in that 29-row range. The following query shows the actual distribution of rows among unique values in that 29-row intra-step range:

SELECT A.City, NumRows = COUNT_BIG(*) FROM Person.[Address] AS A WHERE A.City > N'Abingdon' AND A.City < N'Ballard' GROUP BY ROLLUP (A.City);
Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

There are 29 rows with 19 distinct values, just as the histogram said. Still, it is clear we have no basis to evaluate the selectivity of a predicate on the count column in that query. For example, HAVING COUNT_BIG(*) = 2 would return 5 rows (for Alexandria, Altadena, Atlanta, Augsburg, and Austin) but we have no way to determine that from the histogram.

An Educated Guess

The approach SQL Server takes is to assume that each group is most likely to contain the overall mean (average) number of rows. This is simply the cardinality divided by the number of unique values. For example, for 1000 rows with 20 unique values, SQL Server would assume that (1000 / 20) = 50 rows per group is the most likely value.

Turning back to our original example, this means that the computed count column is "most likely" to contain a value around (19614 / 575) ~= 34.1113 . Since density is the reciprocal of the number of unique values, we can also express that as cardinality * density = (19614 * 0.00173913), giving a very similar result.

Distribution

Saying that the mean value is most likely only takes us so far. We also need to establish exactly how likely that is; and how the likelihood changes as we move away from the mean value. Assuming that all groups have exactly 34.113 rows in our example would not be a very "educated" guess!

SQL Server handles this by assuming a normal distribution . This has the characteristic bell shape you may already be familiar with (image from the linked Wikipedia entry):

Image may be NSFW.
Clik here to view. Cardinality Estimation for a Predicate on a COUNT Expression

The exact shape of the normal distribution depends on two parameters : the mean ( ) and the standard deviation ( σ ). The mean determines the location of the peak. The standard deviation specifies how "flattened" the bell curve is. The flatter the curve, the lower the peak is, and the more the probability density is distributed over other values.

SQL Server can derive the mean from statistical information as already noted. The standard deviation of the computed count column values is unknown. SQL Server estimates it as the square root of the mean (with a slight adjustment detailed later on). In our example, this means the two parameters of the normal distribution are roughly 34.1113 and 5.84 (the square root).

The standard normal distribution (the red curve in the diagram above) is a noteworthy special case. This occurs when the mean is zero, and the standard deviation is 1. Any normal distribution can be transformed to the standard normal distribution by subtracting the mean and dividing by the standard deviation.

Areas and Intervals We are interested in estimating selectivity, so we are looking for the

Cardinality Estimation for a Predicate on a COUNT Expression

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本