Columnstore Indexes part 97 (“Working with Strings”)

Continuation from the previous 96 parts, the whole series can be found at http://www.nikoport.com/columnstore/ .

After many years of using & fighting the String data type in Columnstore Indexes, I have finally decided to write a blog post that is dedicated to them and show some of the cases that I have meet.

For the queries I will be using the same version generated copy of the TPCH database (1GB version), that I have used previously. If you want to have one, use the good old HammerDB (free) or any other software to generate it.

Below you will find the script for restoring the backup of TPCH from the C:\Install\

/* * This script restores backup of the TPC-H Database from the C:\Install */ USE [master] if exists(select * fromsys.databaseswherename = 'tpch') begin alterdatabase [tpch] setSINGLE_USERWITHROLLBACKIMMEDIATE; end RESTOREDATABASE [tpch] FROMDISK = N'C:\Install\tpch_1gb_new.bak' WITHFILE = 1, NOUNLOAD,STATS = 1 alterdatabase [tpch] setMULTI_USER; GO GO ALTERDATABASE [tpch] SETCOMPATIBILITY_LEVEL = 130 GO USE [tpch] GO EXECdbo.sp_changedbowner @loginame = N'sa', @map = false GO USE [master] GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch',FILEGROWTH = 256152KB ) GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch_log', SIZE = 1200152KB , FILEGROWTH = 256000KB )

As previously, here is the script to create a copy of Lineitem_CCI table, which is a copy of the original Lineitem table, but it includes the Clustered Columnstore index:

USE [tpch] GO DROPTABLEIF EXISTSdbo.lineitem_cci; -- DataLoding SELECT [l_shipdate] ,[l_orderkey] ,[l_discount] ,[l_extendedprice] ,[l_suppkey] ,[l_quantity] ,[l_returnflag] ,[l_partkey] ,[l_linestatus] ,[l_tax] ,[l_commitdate] ,[l_receiptdate] ,[l_shipmode] ,[l_linenumber] ,[l_shipinstruct] ,[l_comment] intodbo.lineitem_cci FROM [dbo].[lineitem]; GO -- CreateClusteredColumnstoreIndex createclusteredcolumnstoreindexcci_lineitem_cci ondbo.lineitem_cci; GO String Columns in Fact Tables

Having Strings in Fact tables is something that is quite normal, but to be honest, in the most cases does not make a lot of sense, since we are trying to keep there the information that can be calculated and/or aggregated. Notice that I have written in the most cases and NOT in all cases, because there are some noticeable exceptions. Additionally if you are “feeding” SSAS Tabular with your table this might be much easier to do it directly (hey, there is a solution through the views for that, I was told :)).

In this blog post, I am focusing not on the exceptions but on the typical cases where its not the best option and so here is a basic solution I just wanted to present you an optimised structure, which contains a tinyint column referring to the new table with distinct data for the ShipMode.

You can ensure that this is doable & make sense, by running the following query:

-- ViewdifferentShipModes SELECTl_shipmode, COUNT(*) FROM [dbo].[lineitem_cci] GROUPBYl_shipmode

So, back to our restructured table, which I will call lineitem_cci_optimised and which will contain a substitute column l_shipmode_new differing from the original l_shipmode :

CREATETABLE [dbo].[lineitem_cci_optimised]( [l_shipdate] [date] NULL, [l_orderkey] [bigint] NOT NULL, [l_discount] [money] NOT NULL, [l_extendedprice] [money] NOT NULL, [l_suppkey] [int] NOT NULL, [l_quantity] [bigint] NOT NULL, [l_returnflag] [char](1) NULL, [l_partkey] [bigint] NOT NULL, [l_linestatus] [char](1) NULL, [l_tax] [money] NOT NULL, [l_commitdate] [date] NULL, [l_receiptdate] [date] NULL, l_shipmode_newtinyintNOT NULL,-- Ournew shipmode, storedas a tinyintcolumn [l_linenumber] [bigint] NOT NULL, [l_shipinstruct] [char](25) NULL, [l_comment] [varchar](44) NULL ) ON [PRIMARY] GO CREATECLUSTEREDCOLUMNSTOREINDEX [cci_lineitem_cci_optimised] ON [dbo].[lineitem_cci_optimised] WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY] GO

Now let’s create the Dimension table for the ShipMode, which I will call DimShipMode and let us fill it with the distinct data from the lineitem_cci table:

DROPTABLEIF EXISTSdbo.DimShipMode; CREATETABLEdbo.DimShipMode( shipmode_idtinyintIDENTITY(1,1) NOT NULL PRIMARYKEYCLUSTERED, shipmodechar(10) NOT NULL ); INSERTINTOdbo.DimShipMode (shipmode) VALUES ('UNKNOWN'); INSERTINTOdbo.DimShipMode (shipmode) SELECTl_shipmode FROM [dbo].[lineitem_cci] GROUPBYl_shipmode; INSERTINTOdbo.lineitem_cci_optimised (l_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode_new, l_linenumber, l_shipinstruct, l_comment) SELECT [l_shipdate] ,[l_orderkey] ,[l_discount] ,[l_extendedprice] ,[l_suppkey] ,[l_quantity] ,[l_returnflag] ,[l_partkey] ,[l_linestatus] ,[l_tax] ,[l_commitdate] ,[l_receiptdate] ,m.shipmode_id--- This is ourShipMode, butstoredas a ,[l_linenumber] ,[l_shipinstruct] ,[l_comment] FROM [tpch].[dbo].[lineitem_cci] c INNERJOINdbo.DimShipMode m ON c.l_shipmode = m.shipmode; altertable [dbo].[lineitem_cci_optimised] WITHCHECKADDCONSTRAINT [fk_lineitem_cci_optimised_shipmode] FOREIGNKEY([l_shipmode_new]) REFERENCES [dbo].DimShipMode (shipmode_id);

Let’s run 2 queries comparing the performance of this 2 solutions (original one with just 1 table and the modified one with the Dimension table:

SETSTATISTICSTIME, IOON SELECTTOP 3 l_shipinstruct , SUM(l_quantity) as Quantity , COUNT(distinctl_orderkey) as DistinctOrders , MAX(l_comment) as BiggestComment FROM [dbo].[lineitem_cci] WHEREISNULL(l_shipmode,'RAIL') = 'RAIL' and l_commentnot like 'furiosly%' GROUPBYl_shipinstruct; SELECTTOP 3 l_shipinstruct , SUM(l_quantity) as Quantity , COUNT(distinctl_orderkey) as DistinctOrders5 , MAX(l_comment) as BiggestComment FROM [dbo].[lineitem_cci_optimised] cci INNERJOINdbo.DimShipModeship ONcci.l_shipmode_new = ship.shipmode_id WHEREISNULL(ship.shipmode,'RAIL') = 'RAIL' and l_commentnot like 'furiosly%' GROUPBYl_shipinstruct;

While the first query takes 11.5 seconds on my Virtual Machine, the second one takes only 4,5 seconds ! This is a huge difference which is happening partly due to the some other string limitations, which are naturally not taking place over the dimension table which is a regular Rowstore table.

Notice that this number were just over 2 times because I was running the query on the Developer Edition (same features as in Enterprise Edition) of SQL Server 2016 on Standard & Express the string predicate pushdown for some parts of such queries will not work, making the first query perform even slower in comparisson.

Take a look at the execution plans:

Columnstore Indexes part 97 (“Working with Strings”)

A lot of problems from the first query are easy to expose: almost 6 million rows are being extracted from the Columnstore Index scan, the estimation for the output rows of the Filter iterator are wrong and the Sort iterator have to spill on the TempDB.

The second query does not suffer from these problems and performs reasonably well, at least doubling the performance of the original query.

No Row Group elimination This is one of the huge items in the Columnstore Indexes the inability of the string data types (char,varchar,varc

Columnstore Indexes part 97 (“Working with Strings”)

Trending Articles

幾米／空氣朋友

漫谈赵婷、李安、泰伦斯·马利克和摄影机的“上帝位置”

Paint.NET 大畫家 4.1.4 免安裝繁體中文版 (4.2.16 安裝版) - 取代微軟小畫家

天使心外送茶LINE/即時:beauty109109台南半套店,台南叫雞,台南按摩,台南茶訊 (no replies)

[转载]贾平凹《废都》删节部分增补

HAKERS哈克士戶外 12月8~14日廠拍

[猎户手抄部] 末日后酒店 / Apocalypse Hotel [01] [1080p] [简日内嵌] [2025年4月番]

“75大屠杀”16年台立委：新疆沦现代集中营

几个Office零售版有效密钥

出售: Denon DA-S1 頂級旗艦解碼 (220v)

拉花比賽曾對決情侶開咖啡館

[分享]文字加底線LISP程式

RADStudio v12.2.29.0.53982.0329 KeyPatch [含附件]

中彰投抗空污，立即可做的，可能就是夜間灑水

詐騙猖獗網路名師也中鏢江兆君(小M老師)：學員勿上當！

出售: 蔡琴 24Bit 靚聲雙CD

It's the politics, stupid！—全港性系統評估(TSA)「反面」睇

[分享] 真元各階段需求及增加屬性列表

SM3267AE-量产失败，求助

討稅女王線上看第8集大結局