Quantcast
Viewing all articles
Browse latest Browse all 3160

SQL Server 2019 APPROX_COUNT_DISTINCT Function

By:Aaron Bertrand | Last Updated: 2018-12-19 || Related Tips:More > SQL Server 2019

Problem

Ina recent tip, I provided some details about a whole slew of new features that were announced for the first CTP of SQL Server 2019. There's definitely more to come, but today I wanted to dig a little deeper into a specific feature, the new aggregate function APPROX_COUNT_DISTINCT , and show how I tested it out in my home lab. In this video , Joe Sack explains what differences you should expect to see in the execution plan, but I was curious what I would observe at scale.

Solution

My goal was to test the impact on both query duration and memory footprint for queries using the new function, which extrapolates statistics to come up with rough estimates, compared to the more accurate COUNT(DISTINCT()) approach, which forces a read of every single value. I wanted to see how each approach would fare using both batch mode over row store (another new enhancement) and without. I also wanted to compare against integers and also against strings of varying lengths and various fullness (since things like memory grants for variable-width columns are based on the assumption they will average being half "full"). And finally, I wanted to judge the accuracy of the approximation with various ratios of distinct values. So, for a table with 10,000,000 rows, I wanted to have columns with 100 distinct values, 1,000 distinct values, and so on, up to the subset of columns that would support all 10,000,000 values being unique. Building a table to support this was a job in and of itself.

Here is the script I came up with:

DROP TABLE IF EXISTS dbo.FactTable; GO
SET NOCOUNT ON;
DECLARE @x8 char(8) = REPLICATE('@',8), @x16 char(16) = REPLICATE('!',16), @x32 char(32) = REPLICATE('$',32), @x64 char(64) = REPLICATE('&',64), @x128 char(128) = REPLICATE('^',128), @x256 char(256) = REPLICATE('#',256);
;WITH src(n) AS ( SELECT 1 UNION ALL SELECT n + 1 FROM src WHERE n < 100 ) SELECT
-- integers rn, i1, i2, i3, i4, i5,
-- varchar minimally populated vm0 = CONVERT(varchar(8), rn), vm1 = CONVERT(varchar(8), i1), vm2 = CONVERT(varchar(8), i2), vm3 = CONVERT(varchar(8), i3), vm4 = CONVERT(varchar(8), i4), vm5 = CONVERT(varchar(8), i5),
-- varchar half populated vh1 = CONVERT(varchar(16), RIGHT(@x8 + RTRIM(i1),8)), vh2 = CONVERT(varchar(32), RIGHT(@x16 + RTRIM(i2),16)), vh3 = CONVERT(varchar(64), RIGHT(@x32 + RTRIM(i3),32)), vh4 = CONVERT(varchar(128), RIGHT(@x64 + RTRIM(i4),64)), vh5 = CONVERT(varchar(256), RIGHT(@x128 + RTRIM(i5),128)),
-- varchar fully populated vf0 = CONVERT(varchar(8), RIGHT(@x8 + RTRIM(rn),8)), vf1 = CONVERT(varchar(16), RIGHT(@x16 + RTRIM(i1),16)), vf2 = CONVERT(varchar(32), RIGHT(@x32 + RTRIM(i2),32)), vf3 = CONVERT(varchar(64), RIGHT(@x64 + RTRIM(i3),64)), vf4 = CONVERT(varchar(128), RIGHT(@x128 + RTRIM(i4),128)), vf5 = CONVERT(varchar(256), RIGHT(@x256 + RTRIM(i5),256))
INTO dbo.FactTable FROM ( SELECT rn, i1 = rn / 10, i2 = rn / 100, i3 = rn / 1000, i4 = rn / 10000, i5 = n FROM ( SELECT s.n,rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) 1 FROM src AS s CROSS JOIN src AS s2 CROSS JOIN src AS s3 CROSS JOIN (SELECT TOP (10) n FROM src) AS s4 ) AS x ) AS y; GO
CREATE UNIQUE CLUSTERED INDEX cix ON dbo.FactTable(rn);

Have patience, because this took over two minutes on my machine, but it produced a table with 10,000,000 rows that we could now measure queries against. A few sample rows (well, what I could fit on screen, anyway):

SELECT TOP (10) * FROM dbo.FactTable ORDER BY NEWID();
Image may be NSFW.
Clik here to view.
SQL Server 2019 APPROX_COUNT_DISTINCT Function

I developed a quick script that loops through each of the columns in the table, and inserts counts and timings into a temp table:

DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS; GO
DECLARE @rn int; SELECT @rn = rn FROM dbo.FactTable; GO
DROP TABLE IF EXISTS #s; GO
CREATE TABLE #s(col sysname, cd_rows int, approx_rows int); GO
SET NOCOUNT ON; GO
DELETE #s;
DECLARE @sql nvarchar(max) = N'', @base nvarchar(max) = N'
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', COUNT(DISTINCT($n)), NULL FROM dbo.FactTable OPTION (USE HINT(''''DISALLOW_BATCH_MODE''''));'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', COUNT(DISTINCT($n)), NULL FROM dbo.FactTable;'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', NULL, APPROX_COUNT_DISTINCT($n) FROM dbo.FactTable OPTION (USE HINT(''''DISALLOW_BATCH_MODE''''));'';
EXEC sys.sp_executesql N''INSERT #s(col, cd_rows, approx_rows) SELECT ''''$n'''', NULL, APPROX_COUNT_DISTINCT($n) FROM dbo.FactTable;'';';
SELECT @sql += REPLACE(@base, N'$n', name) FROM sys.columns WHERE object_id = OBJECT_ID(N'dbo.FactTable') AND name <> N'rn';
EXEC sys.sp_executesql @sql; GO 5

Again, have patience here, since this takes some longer than the initial population (about 5 minutes per batch).

Once we have the most recent measures in the temp table, we can look at those as well as timings captured for us automatically in the DMV sys.dm_exec_query_stats :

;WITH x AS ( SELECT s.col, [# of unique values] = MAX(s.cd_rows), [new guess] = MAX(s.approx_rows), [# off] = MAX(s.approx_rows)-MAX(s.cd_rows), [% off] = CASE WHEN MAX(s.approx_rows) < MAX(s.cd_rows) THEN -1 ELSE 1 END * CONVERT(decimal(6,3), MAX(s.approx_rows*1.0)/MAX(s.cd_rows*1.0)), avg_time_actual = CONVERT(decimal(20,2), MAX(stat.avg_time_actual)), avg_time_guess = CONVERT(decimal(20,2), MAX(stat.avg_time_guess)), avg_time_actual_batch = CONVERT(decimal(20,2), MAX(stat.avg_time_actual_batch)), avg_time_guess_batch = CONVERT(decimal(20,2), MAX(stat.avg_time_guess_batch)), avg_memory_actual = CONVERT(decimal(20,1), MAX(stat.avg_memory_actual)), avg_memory_guess = CONVERT(decimal(20,1), MAX(stat.avg_memory_guess)), avg_memory_actual_batch = CONVERT(decimal(20,1), MAX(stat.avg_memory_actual_batch)), avg_memory_guess_batch = CONVERT(decimal(20,1), MAX(stat.avg_memory_guess_batch)) FROM #s AS s LEFT OUTER JOIN ( SELECT text, avg_time_actual = CASE t WHEN 'actual' THEN avg_time END, avg_memory_actual = CASE t WHEN 'actual' THEN avg_memory END, avg_time_guess = CASE t WHEN 'guess' THEN avg_time END, avg_memory_guess = CASE t WHEN 'guess' THEN avg_memory END, avg_time_actual_batch = CASE t WHEN 'actual_batch' THEN avg_time END, avg_memory_actual_batch = CASE t WHEN 'actual_batch' THEN avg_memory END, avg_time_guess_batch = CASE t WHEN 'guess_batch' THEN avg_time END, avg_memory_guess_batch = CASE t WHEN 'guess_batch' THEN avg_memory END FROM ( SELECT t.text, t = CASE WHEN t.text LIKE N'%COUNT(DISTINCT(%DISALLOW%' THEN 'actual' WHEN t.text LIKE N'%COUNT(DISTINCT(%' THEN 'actual_batch' WHEN t.text LIKE N'%APPROX_COUNT%dbo.FactTable%DISALLOW%' THEN 'gues

Viewing all articles
Browse latest Browse all 3160

Trending Articles