Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Columnstore Indexes part 97 (“Working with Strings”)

$
0
0

Continuation from the previous 96 parts, the whole series can be found at http://www.nikoport.com/columnstore/ .

After many years of using & fighting the String data type in Columnstore Indexes, I have finally decided to write a blog post that is dedicated to them and show some of the cases that I have meet.

For the queries I will be using the same version generated copy of the TPCH database (1GB version), that I have used previously. If you want to have one, use the good old HammerDB (free) or any other software to generate it.

Below you will find the script for restoring the backup of TPCH from the C:\Install\

/* * This script restores backup of the TPC-H Database from the C:\Install */ USE [master] if exists(select * fromsys.databaseswherename = 'tpch') begin alterdatabase [tpch] setSINGLE_USERWITHROLLBACKIMMEDIATE; end RESTOREDATABASE [tpch] FROMDISK = N'C:\Install\tpch_1gb_new.bak' WITHFILE = 1, NOUNLOAD,STATS = 1 alterdatabase [tpch] setMULTI_USER; GO GO ALTERDATABASE [tpch] SETCOMPATIBILITY_LEVEL = 130 GO USE [tpch] GO EXECdbo.sp_changedbowner @loginame = N'sa', @map = false GO USE [master] GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch',FILEGROWTH = 256152KB ) GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch_log', SIZE = 1200152KB , FILEGROWTH = 256000KB )

As previously, here is the script to create a copy of Lineitem_CCI table, which is a copy of the original Lineitem table, but it includes the Clustered Columnstore index:

USE [tpch] GO DROPTABLEIF EXISTSdbo.lineitem_cci; -- DataLoding SELECT [l_shipdate] ,[l_orderkey] ,[l_discount] ,[l_extendedprice] ,[l_suppkey] ,[l_quantity] ,[l_returnflag] ,[l_partkey] ,[l_linestatus] ,[l_tax] ,[l_commitdate] ,[l_receiptdate] ,[l_shipmode] ,[l_linenumber] ,[l_shipinstruct] ,[l_comment] intodbo.lineitem_cci FROM [dbo].[lineitem]; GO -- CreateClusteredColumnstoreIndex createclusteredcolumnstoreindexcci_lineitem_cci ondbo.lineitem_cci; GO String Columns in Fact Tables

Having Strings in Fact tables is something that is quite normal, but to be honest, in the most cases does not make a lot of sense, since we are trying to keep there the information that can be calculated and/or aggregated. Notice that I have written in the most cases and NOT in all cases, because there are some noticeable exceptions. Additionally if you are “feeding” SSAS Tabular with your table this might be much easier to do it directly (hey, there is a solution through the views for that, I was told :)).

In this blog post, I am focusing not on the exceptions but on the typical cases where its not the best option and so here is a basic solution I just wanted to present you an optimised structure, which contains a tinyint column referring to the new table with distinct data for the ShipMode.

You can ensure that this is doable & make sense, by running the following query:

-- ViewdifferentShipModes SELECTl_shipmode, COUNT(*) FROM [dbo].[lineitem_cci] GROUPBYl_shipmode

So, back to our restructured table, which I will call lineitem_cci_optimised and which will contain a substitute column l_shipmode_new differing from the original l_shipmode :

CREATETABLE [dbo].[lineitem_cci_optimised]( [l_shipdate] [date] NULL, [l_orderkey] [bigint] NOT NULL, [l_discount] [money] NOT NULL, [l_extendedprice] [money] NOT NULL, [l_suppkey] [int] NOT NULL, [l_quantity] [bigint] NOT NULL, [l_returnflag] [char](1) NULL, [l_partkey] [bigint] NOT NULL, [l_linestatus] [char](1) NULL, [l_tax] [money] NOT NULL, [l_commitdate] [date] NULL, [l_receiptdate] [date] NULL, l_shipmode_newtinyintNOT NULL,-- Ournew shipmode, storedas a tinyintcolumn [l_linenumber] [bigint] NOT NULL, [l_shipinstruct] [char](25) NULL, [l_comment] [varchar](44) NULL ) ON [PRIMARY] GO CREATECLUSTEREDCOLUMNSTOREINDEX [cci_lineitem_cci_optimised] ON [dbo].[lineitem_cci_optimised] WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY] GO

Now let’s create the Dimension table for the ShipMode, which I will call DimShipMode and let us fill it with the distinct data from the lineitem_cci table:

DROPTABLEIF EXISTSdbo.DimShipMode; CREATETABLEdbo.DimShipMode( shipmode_idtinyintIDENTITY(1,1) NOT NULL PRIMARYKEYCLUSTERED, shipmodechar(10) NOT NULL ); INSERTINTOdbo.DimShipMode (shipmode) VALUES ('UNKNOWN'); INSERTINTOdbo.DimShipMode (shipmode) SELECTl_shipmode FROM [dbo].[lineitem_cci] GROUPBYl_shipmode; INSERTINTOdbo.lineitem_cci_optimised (l_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode_new, l_linenumber, l_shipinstruct, l_comment) SELECT [l_shipdate] ,[l_orderkey] ,[l_discount] ,[l_extendedprice] ,[l_suppkey] ,[l_quantity] ,[l_returnflag] ,[l_partkey] ,[l_linestatus] ,[l_tax] ,[l_commitdate] ,[l_receiptdate] ,m.shipmode_id--- This is ourShipMode, butstoredas a ,[l_linenumber] ,[l_shipinstruct] ,[l_comment] FROM [tpch].[dbo].[lineitem_cci] c INNERJOINdbo.DimShipMode m ON c.l_shipmode = m.shipmode; altertable [dbo].[lineitem_cci_optimised] WITHCHECKADDCONSTRAINT [fk_lineitem_cci_optimised_shipmode] FOREIGNKEY([l_shipmode_new]) REFERENCES [dbo].DimShipMode (shipmode_id);

Let’s run 2 queries comparing the performance of this 2 solutions (original one with just 1 table and the modified one with the Dimension table:

SETSTATISTICSTIME, IOON SELECTTOP 3 l_shipinstruct , SUM(l_quantity) as Quantity , COUNT(distinctl_orderkey) as DistinctOrders , MAX(l_comment) as BiggestComment FROM [dbo].[lineitem_cci] WHEREISNULL(l_shipmode,'RAIL') = 'RAIL' and l_commentnot like 'furiosly%' GROUPBYl_shipinstruct; SELECTTOP 3 l_shipinstruct , SUM(l_quantity) as Quantity , COUNT(distinctl_orderkey) as DistinctOrders5 , MAX(l_comment) as BiggestComment FROM [dbo].[lineitem_cci_optimised] cci INNERJOINdbo.DimShipModeship ONcci.l_shipmode_new = ship.shipmode_id WHEREISNULL(ship.shipmode,'RAIL') = 'RAIL' and l_commentnot like 'furiosly%' GROUPBYl_shipinstruct;

While the first query takes 11.5 seconds on my Virtual Machine, the second one takes only 4,5 seconds ! This is a huge difference which is happening partly due to the some other string limitations, which are naturally not taking place over the dimension table which is a regular Rowstore table.

Notice that this number were just over 2 times because I was running the query on the Developer Edition (same features as in Enterprise Edition) of SQL Server 2016 on Standard & Express the string predicate pushdown for some parts of such queries will not work, making the first query perform even slower in comparisson.

Take a look at the execution plans:


Columnstore Indexes   part 97 (“Working with Strings”)

A lot of problems from the first query are easy to expose: almost 6 million rows are being extracted from the Columnstore Index scan, the estimation for the output rows of the Filter iterator are wrong and the Sort iterator have to spill on the TempDB.

The second query does not suffer from these problems and performs reasonably well, at least doubling the performance of the original query.

No Row Group elimination This is one of the huge items in the Columnstore Indexes the inability of the string data types (char,varchar,varc

Viewing all articles
Browse latest Browse all 3160

Trending Articles