Continuation from the previous 98 parts, the whole series can be found at http://www.nikoport.com/columnstore/ .
This blog post is focused on the MERGE statement for the Columnstore Indexes, or as I call it the worst enemy of the Columnstore Indexes. It is extremely difficult to imagine some statement or way of making the worst out of the Columnstore Indexes, if not the infamous MERGE statement. Why ? Because it is not only making Columnstore Indexes perform slow, it will make them perform MUCH SLOWER then any Rowstore Indexes . Yes, you have read right slower then ANY_ROWSTORE_INDEXES. In fact, this should be a hint that one should apply to the Merge statement, when it is executed against Columnstore Indexes!
I decide to dedicate a whole blog post on this matter, mainly to warn people of this pretty problematic statement I hope not to see it being used for Columnstore Indexes in the future!
MERGE T-SQL statement has a huge number of bugs and potential problems with some statements delivering incorrect results or being canceled for the details I recommend that you read Use Caution with SQL Server’s MERGE Statement where Aaron Bertrand went into the details of why one should strive to avoid using this statement.
You might point that a couple of years ago, I have already published a blog post on the dangers of using UPDATE statement for the Columnstore Indexes , but as I keep seeing MERGE statement on the production servers, it is clearly deserves an own post.
As in the other blog post I will be using the a generated copy of the TPCH database (10GB version this time, because I want my tests to reflect bigger workloads), that I have generated with HammerDB (free software).
Below you will find the script for restoring the backup of TPCH from the C:\Install\
USE [master] if exists(select * fromsys.databaseswherename = 'tpch') begin alterdatabase [tpch] setSINGLE_USERWITHROLLBACKIMMEDIATE; end RESTOREDATABASE [tpch] FROMDISK = N'C:\Install\tpch_10gb.bak' WITHFILE = 1, NOUNLOAD,STATS = 1 alterdatabase [tpch] setMULTI_USER; GO GO ALTERDATABASE [tpch] SETCOMPATIBILITY_LEVEL = 130 GO USE [tpch] GO EXECdbo.sp_changedbowner @loginame = N'sa', @map = false GO USE [master] GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch',FILEGROWTH = 2561520KB ) GO ALTERDATABASE [tpch] MODIFYFILE ( NAME = N'tpch_log', SIZE = 1200152KB , FILEGROWTH = 256000KB )I will be showing the data loading on the sample table dbo.lineitem and as previously, here is the setup script for creating the Clustered Columnstore Index on it:
USE [tpch] GO DROPTABLEIF EXISTSdbo.lineitem_cci; -- DataLoding SELECT [l_shipdate] ,[l_orderkey] ,[l_discount] ,[l_extendedprice] ,[l_suppkey] ,[l_quantity] ,[l_returnflag] ,[l_partkey] ,[l_linestatus] ,[l_tax] ,[l_commitdate] ,[l_receiptdate] ,[l_shipmode] ,[l_linenumber] ,[l_shipinstruct] ,[l_comment] intodbo.lineitem_cci FROM [dbo].[lineitem]; -- CreateClusteredColumnstoreIndex createclusteredcolumnstoreindexcci_lineitem_cci ondbo.lineitem_cci;Now we have a sweet 60 million rows within our lineitem table.
For the data loading test, I will be using the following staging table:
USE [tpch] GO DROPTABLEIF EXISTS [dbo].[lineitem_cci_stage]; CREATETABLE [dbo].[lineitem_cci_stage]( [l_shipdate] [date] NULL, [l_orderkey] [bigint] NOT NULL, [l_discount] [money] NOT NULL, [l_extendedprice] [money] NOT NULL, [l_suppkey] [int] NOT NULL, [l_quantity] [bigint] NOT NULL, [l_returnflag] [char](1) NULL, [l_partkey] [bigint] NOT NULL, [l_linestatus] [char](1) NULL, [l_tax] [money] NOT NULL, [l_commitdate] [date] NULL, [l_receiptdate] [date] NULL, [l_shipmode] [char](10) NULL, [l_linenumber] [bigint] NOT NULL, [l_shipinstruct] [char](25) NULL, [l_comment] [varchar](44) NULL ) ON [PRIMARY] GO CREATECLUSTEREDCOLUMNSTOREINDEX [cci_lineitem_cci_stage] ON [dbo].[lineitem_cci_stage]; GOFor the test I decided to load from the staging and update our lineitem_cci table with 1 and with 5 million rows.
First of all let us start with extracting 1 million rows and putting them into the staging table:
TRUNCATETABLE [dbo].[lineitem_cci_stage]; INSERTINTO [dbo].[lineitem_cci_stage] WITH (TABLOCK) SELECTTOP 1000000 l_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode, l_linenumber, l_shipinstruct, l_comment FROMdbo.lineitem_cci;Let’s MERGE the data from the Stage table into our main one, like a lot of people are doing. Surely this statement won’t hurt my test VM, running on the Azure Standard DS12 v2 (4 cores, 28 GB memory):
MERGEINTO [dbo].[lineitem_cci] AS [Target] USING ( SELECT l_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode, l_linenumber, l_shipinstruct, l_comment FROMdbo.lineitem_cci_stage ) AS [Source] ON Target.L_ORDERKEY = Source.L_ORDERKEY AND Target.L_LINENUMBER = Source.L_LINENUMBER WHENMATCHEDTHEN UPDATE SETl_shipdate = Source.l_shipdate, l_orderkey = Source.l_orderkey, l_discount = Source.l_discount, l_extendedprice = Source.l_extendedprice, l_suppkey = Source.l_suppkey, l_quantity = Source.l_quantity, l_returnflag = Source.l_returnflag, l_partkey = Source.l_partkey, l_linestatus = Source.l_linestatus, l_tax = Source.l_tax, l_commitdate = Source.l_commitdate, l_receiptdate = Source.l_receiptdate, l_shipmode = Source.l_shipmode, l_linenumber = Source.l_linenumber, l_shipinstruct = Source.l_shipinstruct, l_comment = Source.l_comment WHENNOT MATCHEDBYTARGETTHEN INSERT (l_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode, l_linenumber, l_shipinstruct, l_comment) VALUES (Source.l_shipdate, Source.l_orderkey, Source.l_discount, Source.l_extendedprice, Source.l_suppkey, Source.l_quantity, Source.l_returnflag, Source.l_partkey, Source.l_linestatus, Source.l_tax, Source.l_commitdate, Source.l_receiptdate, Source.l_shipmode, Source.l_linenumber, Source.l_shipinstruct, Source.l_comment);It took 27.3 seconds on the average to execute this operation (notice again, I am not pitching you here canceled transaction, because of the repeatedly touched row or any of other bugs or inconsistencies) with 29.5 seconds CPU time burnt.
Is it good ?
Is it bad ?
Let us find out by running the same data loading procedure, but this time with the help of DELET & INSERT statements:
DELETETarget FROMdbo.lineitem_ccias Target WHEREEXISTS ( SELECT 1 FROMdbo.lineitem_cci_stageas Source WHERETarget.L_ORDERKEY = Source.L_ORDERKEY AND Target.L_LINENUMBER = Source.L_LINENUMBER ); INSERTINTOdbo.lineitem_cciWITH (TABLOCK) SELECTl_shipdate, l_orderkey, l_discount, l_extendedprice, l_suppkey, l_quantity, l_returnflag, l_partkey, l_linestatus, l_tax, l_commitdate, l_receiptdate, l_shipmode, l_linenumber, l_shipinstruct, l_comment FROM [dbo].[lineitem_cci_stage];