This article looks into selectivity and cardinality estimation for predicates on COUNT(*) expressions, as may be seen in HAVING clauses. The details are hopefully interesting in themselves. They also provide an insight into some of the general approaches and algorithms used by the cardinality estimator.
A simple example using the AdventureWorks sample database:
SELECT A.City FROM Person.[Address] AS A GROUP BY A.City HAVING COUNT_BIG(*) = 1;We are interested in seeing how SQL Server derives an estimate for the predicate on the count expression in the HAVING clause.
Of course the HAVING clause is just syntax sugar. We could equally have written the query using a derived table, or common table-expression:
-- Derived table SELECT SQ1.City FROM ( SELECT A.City, Expr1001 = COUNT_BIG(*) FROM Person.[Address] AS A GROUP BY A.City ) AS SQ1 WHERE SQ1.Expr1001 = 1; -- CTE WITH Grouped AS ( SELECT A.City, Expr1001 = COUNT_BIG(*) FROM Person.[Address] AS A GROUP BY A.City ) SELECT G.City FROM Grouped AS G WHERE G.Expr1001 = 1;All three query forms produce the same execution plan, with identical query plan hash values.
The post-execution (actual) plan shows a perfect estimation for the aggregate; however, the estimate for the HAVING clause filter (or equivalent, in the other query forms) is poor:
Image may be NSFW.
Clik here to view.

Statistics on the City column provide accurate information about the number of distinct city values:
DBCC SHOW_STATISTICS ([Person.Address], City) WITH DENSITY_VECTOR;Image may be NSFW.
Clik here to view.

The all density figure is the reciprocal of the number of unique values. Simply calculating (1 / 0.00173913) = 575 gives the cardinality estimate for the aggregate. Grouping by city obviously produces one row for each distinct value.
Note that all density comes from the density vector. Be careful not to accidentally use the density value from the statistics header output of DBCC SHOW_STATISTICS . The header density is maintained only for backward compatibility; it is not used by the optimizer during cardinality estimation these days.
The ProblemThe aggregate introduces a new computed column to the workflow, labelled Expr1001 in the execution plan. It contains the value of COUNT(*) in each grouped output row:
Image may be NSFW.
Clik here to view.

There is obviously no statistical information in the database about this new computed column. While the optimizer knows there will be 575 rows, it knows nothing about the distribution of count values within those rows.
Well not quite nothing: The optimizer is aware that the count values will be positive integers (1, 2, 3…). Still, it is the distribution of these integer count values among the 575 rows that would be needed to accurately estimate the selectivity of the COUNT(*) = 1 predicate.
One might think that some sort of distribution information could be derived from the histogram, but the histogram only provides specific count information (in EQ_ROWS ) for histogram step values. Between histogram steps, all we have is a summary: RANGE_ROWS rows have DISTINCT_RANGE_ROWS distinct values. For tables that are large enough that we care about the quality of the selectivity estimate, it is very likely that most of the table is represented by these intra-step summaries.
For example, the first two rows of the City column histogram are:
DBCC SHOW_STATISTICS ([Person.Address], City) WITH HISTOGRAM;Image may be NSFW.
Clik here to view.

This tells us that there is exactly one row for "Abingdon", and 29 other rows after "Abingdon" but before "Ballard", with 19 distinct values in that 29-row range. The following query shows the actual distribution of rows among unique values in that 29-row intra-step range:
SELECT A.City, NumRows = COUNT_BIG(*) FROM Person.[Address] AS A WHERE A.City > N'Abingdon' AND A.City < N'Ballard' GROUP BY ROLLUP (A.City);Image may be NSFW.
Clik here to view.

There are 29 rows with 19 distinct values, just as the histogram said. Still, it is clear we have no basis to evaluate the selectivity of a predicate on the count column in that query. For example, HAVING COUNT_BIG(*) = 2 would return 5 rows (for Alexandria, Altadena, Atlanta, Augsburg, and Austin) but we have no way to determine that from the histogram.
An Educated GuessThe approach SQL Server takes is to assume that each group is most likely to contain the overall mean (average) number of rows. This is simply the cardinality divided by the number of unique values. For example, for 1000 rows with 20 unique values, SQL Server would assume that (1000 / 20) = 50 rows per group is the most likely value.
Turning back to our original example, this means that the computed count column is "most likely" to contain a value around (19614 / 575) ~= 34.1113 . Since density is the reciprocal of the number of unique values, we can also express that as cardinality * density = (19614 * 0.00173913), giving a very similar result.
DistributionSaying that the mean value is most likely only takes us so far. We also need to establish exactly how likely that is; and how the likelihood changes as we move away from the mean value. Assuming that all groups have exactly 34.113 rows in our example would not be a very "educated" guess!
SQL Server handles this by assuming a normal distribution . This has the characteristic bell shape you may already be familiar with (image from the linked Wikipedia entry):
Image may be NSFW.
Clik here to view.

The exact shape of the normal distribution depends on two parameters : the mean ( ) and the standard deviation ( σ ). The mean determines the location of the peak. The standard deviation specifies how "flattened" the bell curve is. The flatter the curve, the lower the peak is, and the more the probability density is distributed over other values.
SQL Server can derive the mean from statistical information as already noted. The standard deviation of the computed count column values is unknown. SQL Server estimates it as the square root of the mean (with a slight adjustment detailed later on). In our example, this means the two parameters of the normal distribution are roughly 34.1113 and 5.84 (the square root).
The standard normal distribution (the red curve in the diagram above) is a noteworthy special case. This occurs when the mean is zero, and the standard deviation is 1. Any normal distribution can be transformed to the standard normal distribution by subtracting the mean and dividing by the standard deviation.
Areas and Intervals We are interested in estimating selectivity, so we are looking for the