SQL Server 2019 APPROX_COUNT

By:Aaron Bertrand | Last Updated: 2018-12-19 || Related Tips:More > SQL Server 2019

Problem

Ina recent tip, I provided some details about a whole slew of new features that were announced for the first CTP of SQL Server 2019. There's definitely more to come, but today I wanted to dig a little deeper into a specific feature, the new aggregate function APPROX_COUNT_DISTINCT , and show how I tested it out in my home lab. In this video , Joe Sack explains what differences you should expect to see in the execution plan, but I was curious what I would observe at scale.

Solution

My goal was to test the impact on both query duration and memory footprint for queries using the new function, which extrapolates statistics to come up with rough estimates, compared to the more accurate COUNT(DISTINCT()) approach, which forces a read of every single value. I wanted to see how each approach would fare using both batch mode over row store (another new enhancement) and without. I also wanted to compare against integers and also against strings of varying lengths and various fullness (since things like memory grants for variable-width columns are based on the assumption they will average being half "full"). And finally, I wanted to judge the accuracy of the approximation with various ratios of distinct values. So, for a table with 10,000,000 rows, I wanted to have columns with 100 distinct values, 1,000 distinct values, and so on, up to the subset of columns that would support all 10,000,000 values being unique. Building a table to support this was a job in and of itself.

Here is the script I came up with:

DROP TABLE IF EXISTS dbo.FactTable; GO
SET NOCOUNT ON;
DECLARE @x8 char(8) = REPLICATE('@',8), @x16 char(16) = REPLICATE('!',16), @x32 char(32) = REPLICATE('$',32), @x64 char(64) = REPLICATE('&',64), @x128 char(128) = REPLICATE('^',128), @x256 char(256) = REPLICATE('#',256);
;WITH src(n) AS ( SELECT 1 UNION ALL SELECT n + 1 FROM src WHERE n < 100 ) SELECT
-- integers rn, i1, i2, i3, i4, i5,
-- varchar minimally populated vm0 = CONVERT(varchar(8), rn), vm1 = CONVERT(varchar(8), i1), vm2 = CONVERT(varchar(8), i2), vm3 = CONVERT(varchar(8), i3), vm4 = CONVERT(varchar(8), i4), vm5 = CONVERT(varchar(8), i5),
-- varchar half populated vh1 = CONVERT(varchar(16), RIGHT(@x8 + RTRIM(i1),8)), vh2 = CONVERT(varchar(32), RIGHT(@x16 + RTRIM(i2),16)), vh3 = CONVERT(varchar(64), RIGHT(@x32 + RTRIM(i3),32)), vh4 = CONVERT(varchar(128), RIGHT(@x64 + RTRIM(i4),64)), vh5 = CONVERT(varchar(256), RIGHT(@x128 + RTRIM(i5),128)),
-- varchar fully populated vf0 = CONVERT(varchar(8), RIGHT(@x8 + RTRIM(rn),8)), vf1 = CONVERT(varchar(16), RIGHT(@x16 + RTRIM(i1),16)), vf2 = CONVERT(varchar(32), RIGHT(@x32 + RTRIM(i2),32)), vf3 = CONVERT(varchar(64), RIGHT(@x64 + RTRIM(i3),64)), vf4 = CONVERT(varchar(128), RIGHT(@x128 + RTRIM(i4),128)), vf5 = CONVERT(varchar(256), RIGHT(@x256 + RTRIM(i5),256))
INTO dbo.FactTable FROM ( SELECT rn, i1 = rn / 10, i2 = rn / 100, i3 = rn / 1000, i4 = rn / 10000, i5 = n FROM ( SELECT s.n,rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) 1 FROM src AS s CROSS JOIN src AS s2 CROSS JOIN src AS s3 CROSS JOIN (SELECT TOP (10) n FROM src) AS s4 ) AS x ) AS y; GO
CREATE UNIQUE CLUSTERED INDEX cix ON dbo.FactTable(rn);

Have patience, because this took over two minutes on my machine, but it produced a table with 10,000,000 rows that we could now measure queries against. A few sample rows (well, what I could fit on screen, anyway):

SELECT TOP (10) * FROM dbo.FactTable ORDER BY NEWID();
SQL Server 2019 APPROX_COUNT_DISTINCT Function

I developed a quick script that loops through each of the columns in the table, and inserts counts and timings into a temp table:

DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS; GO
DECLARE @rn int; SELECT @rn = rn FROM dbo.FactTable; GO
DROP TABLE IF EXISTS #s; GO
CREATE TABLE #s(col sysname, cd_rows int, approx_rows int); GO
SET NOCOUNT ON; GO
DELETE #s;
DECLARE @sql nvarchar(max) = N'', @base nvarchar(max) = N'
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', COUNT(DISTINCT($n)), NULL FROM dbo.FactTable OPTION (USE HINT(''''DISALLOW_BATCH_MODE''''));'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', COUNT(DISTINCT($n)), NULL FROM dbo.FactTable;'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', NULL, APPROX_COUNT_DISTINCT($n) FROM dbo.FactTable OPTION (USE HINT(''''DISALLOW_BATCH_MODE''''));'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', NULL, APPROX_COUNT_DISTINCT($n) FROM dbo.FactTable;'';';
SELECT @sql += REPLACE(@base, N'$n', name) FROM sys.columns WHERE object_id = OBJECT_ID(N'dbo.FactTable') AND name <> N'rn';
EXEC sys.sp_executesql @sql; GO 5

Again, have patience here, since this takes some longer than the initial population (about 5 minutes per batch).

Once we have the most recent measures in the temp table, we can look at those as well as timings captured for us automatically in the DMV sys.dm_exec_query_stats :

;WITH x AS ( SELECT s.col, [# of unique values] = MAX(s.cd_rows), [new guess] = MAX(s.approx_rows), [# off] = MAX(s.approx_rows)-MAX(s.cd_rows), [% off] = CASE WHEN MAX(s.approx_rows) < MAX(s.cd_rows) THEN -1 ELSE 1 END * CONVERT(decimal(6,3), MAX(s.approx_rows*1.0)/MAX(s.cd_rows*1.0)), avg_time_actual = CONVERT(decimal(20,2), MAX(stat.avg_time_actual)), avg_time_guess = CONVERT(decimal(20,2), MAX(stat.avg_time_guess)), avg_time_actual_batch = CONVERT(decimal(20,2), MAX(stat.avg_time_actual_batch)), avg_time_guess_batch = CONVERT(decimal(20,2), MAX(stat.avg_time_guess_batch)), avg_memory_actual = CONVERT(decimal(20,1), MAX(stat.avg_memory_actual)), avg_memory_guess = CONVERT(decimal(20,1), MAX(stat.avg_memory_guess)), avg_memory_actual_batch = CONVERT(decimal(20,1), MAX(stat.avg_memory_actual_batch)), avg_memory_guess_batch = CONVERT(decimal(20,1), MAX(stat.avg_memory_guess_batch)) FROM #s AS s LEFT OUTER JOIN ( SELECT text, avg_time_actual = CASE t WHEN 'actual' THEN avg_time END, avg_memory_actual = CASE t WHEN 'actual' THEN avg_memory END, avg_time_guess = CASE t WHEN 'guess' THEN avg_time END, avg_memory_guess = CASE t WHEN 'guess' THEN avg_memory END, avg_time_actual_batch = CASE t WHEN 'actual_batch' THEN avg_time END, avg_memory_actual_batch = CASE t WHEN 'actual_batch' THEN avg_memory END, avg_time_guess_batch = CASE t WHEN 'guess_batch' THEN avg_time END, avg_memory_guess_batch = CASE t WHEN 'guess_batch' THEN avg_memory END FROM ( SELECT t.text, t = CASE WHEN t.text LIKE N'%COUNT(DISTINCT(%DISALLOW%' THEN 'actual' WHEN t.text LIKE N'%COUNT(DISTINCT(%' THEN 'actual_batch' WHEN t.text LIKE N'%APPROX_COUNT%dbo.FactTable%DISALLOW%' THEN 'gues

SQL Server 2019 APPROX_COUNT_DISTINCT Function

Trending Articles

fabia combi 原廠音響加裝後車門音響喇叭的經驗分享

泰语每日一词：เมา“醉”（Day 726）

各类游戏机终极档案PDF

家樂福便利購超市台南開元店開幕

琥珀金開箱

“75大屠杀”16年台立委：新疆沦现代集中营

[转载]贾平凹《废都》删节部分增补

詐騙猖獗網路名師也中鏢江兆君(小M老師)：學員勿上當！

onActivityCallback的params.result返回值没有生效

[黑白字幕组]我的可爱对黑岩目高不管用 / Kuroiwa Medaka ni Watashi no Kawaii ga Tsuujinai [07]...

拉花比賽曾對決情侶開咖啡館

RADStudio v12.2.29.0.53982.0329 KeyPatch [含附件]

出售: PERREAUX 750 POWER AMP

尋找竹東富廣和合已購戶

人气声优井上麻里奈裸背写真集「Marilro」美图欣赏

天使心外送茶LINE/即時:beauty109109台南半套店,台南叫雞,台南按摩,台南茶訊 (no replies)

女团AMOi-AMOi掉行李女警速寻回

NZXT CAM 3.0.3 中文版 - 電腦溫度監控軟體支援手機遠端監控

告發片商強迫拍AV？ SOD「最強新人」竹内乃愛作品被刪光

令Gaussian 16中SCF未收敛到默认收敛限也能继续做后续计算的方法