Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

How-to load data fast into SQL Server 2016

$
0
0

What will be the absolute fastest way to load data from a flatfile into a table within SQL Server 2016? A lot has changed since myinitial post on this topic many years ago, ike the introduction of In-memory optimized tables and Updateable Columnstore table indexes. Also the list of data transport vehicles to choose from is growing: besides BCP, the T-SQL Bulk Insert command, SSIS as ETL tool and PowerShell there are some new ones added, like PolyBase, External R Script or ADF. In this post I will start with the checking how much faster the new durable & non-durable In-memory tables are!


How-to load data fast into SQL Server 2016
Setting the Baseline

For these tests I’m using an Azure DS4_V2 Standard VM with 8 cores/28 GB of RAM and 2 HDD Volumes with host caching R/W enabled. (Both Luns provide ~275 MB/sec R/W throughput although the GUI states a limit of 60MB/sec). I generated a single ~60 Million row/7.2 Gigabyte TPCH lineitem flat file as data to load.

As baseline to for use for comparison we will use the time it takes to load the file into a Heap table:

BULK INSERT [LINEITEM_HEAP] FROM 'F:\TPCH\lineitem.tbl' WITH ( FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' );

This regular Bulk Insert command completes within 7 minutes with an average of 143K rows/sec.

Enabling the test database for Memory optimized tables

The (in SQL2014/2016 Enterprise & developer Edition ) introduced in-memory tables are designed for very fast OLTP with many small transactions and high concurrency, which is a completely different type of workload as bulk inserting… but, just out of curiosity…lets give it a try!

There are 2 types of in-memory tables: durable and non-durable tables. The durable ones will persist data on disk, the non-durable ones won’t. To enable this option we have to do some housekeeping and assign a fast disk volume for hosting these files.

First, alter the database to enable the ‘Contains MEMORY_OPTIMIZED_DATA’ option followed by adding a File location and Filegroup that will contain the Memory-Optimized tables:

ALTER DATABASE [TicketReservations] ADD FILEGROUP [memory_optimized_filegroup_0] CONTAINS MEMORY_OPTIMIZED_DATA GO ALTER DATABASE [TicketReservations] ADD FILE ( NAME = N'memory_optimized_file', FILENAME = N'E:\data\memory_optimized_files' ) TO FILEGROUP [memory_optimized_filegroup_0] GO

The third thing to do is to add a separate memory pool to the SQL Server instance so it can keep all the data we will load into in-memory tables separate from its default memory pool:

Binding a database to a memory pool

The steps to define a separate memory pool and to bind a database to it are listed below:

Extra memory pools are managed via the SQL Resource Governor.

CREATE RESOURCE POOL [PoolHk] WITH( min_cpu_percent=0, max_cpu_percent=100, min_memory_percent=0, max_memory_percent=70, AFFINITY SCHEDULER = AUTO ) -- Dynamically change the value of MAX_MEMORY_PERCENT ALTER RESOURCE POOL PoolHk WITH (MAX_MEMORY_PERCENT = 80 ) -- RECONFIGURE enables resource governor ALTER RESOURCE GOVERNOR RECONFIGURE ALTER RESOURCE GOVERNOR RESET STATISTICS

The 4th and last step is to bind the test database to the new memory Pool with the sys.sp_xtp_bind_db_resource_pool command. In order for the binding to become effective we have to take the database offline and bring it back online. Once bound we can dynamically change the amount of memory assigned to its pool via the ‘ ALTER RESOURCE POOL PoolHk WITH (MAX_MEMORY_PERCENT = 80 )’ command.

-- BIND the Database to the Pool exec sys.sp_xtp_bind_db_resource_pool N'TicketReservations', N'PoolHk'; -- take DB offline: ALTER DATABASE [TicketReservations] SET OFFLINE WITH ROLLBACK IMMEDIATE -- Bring the Database back Online: ALTER DATABASE [TicketReservations] SET ONLINE -- Deleting the Resource Pool: -- sys.sp_xtp_unbind_db_resource_pool N'TicketReservations' -- DROP RESOURCE POOL [PoolHk] Bulk Insert into Durable In-Memory table

Now we are all set with the In-memory option enabled, we can create an in-memory table. Every memory-optimized table must have at least one index (either a Range- or Hash index ) which are completely (re-)composed in memory and are never stored on disk.

A durable table must have a declared primary key, which could then be supported by the required index. To support a primary key I added an extra row_number ROWID1 column to the table:

----- Durable table: CREATE TABLE [dbo].[LINEITEM_MO_Final] ( [ROWID1] [bigint] NOT NULL, [L_SHIPDATE] [smalldatetime] NOT NULL, .. [L_COMMENT] [varchar](44) COLLATE Latin1_General_100_BIN2 NOT NULL, constraint RowID_key primary key nonclustered hash (ROWID1) WITH (bucket_count=10000000) ) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)

Specifying a batch size of 1 (up to 5) Million rows to the bulk insert command helps to persist data to disk while the bulk insert is ongoing (instead of saving it all at the end); doing so minimizes memory pressure on the memory pool PookHK we created.

------------------------------------------------------ -- Bulk insert into Durable Memory Optimized table -- ------------------------------------------------------ BULK INSERT [LINEITEM_Memory_Optimized_D] FROM 'F:\TPCH\lineitem_export.csv' WITH ( FIELDTERMINATOR = '|', ROWTERMINATOR = '\n', Batchsize = 5000000 );

The data load into the durable In-Memory table completes in 5 minutes 28 seconds, or 183K Rows/sec. That’s an ‘okay’ time but not that much faster than our baseline. Looking at the sys.dm_os_wait_stats shows that the no.1 waitstat is ‘IMPPROV_IOWAIT’ which occurs when SQL Server waits for a bulk load I/O to finish. Looking at the Performance counter ‘Bulk Copy Rows/sec and Disk Write Bytes/sec shows the flushing to disk spikes of 275 MB/sec once a batch got in (the green spikes). That is the maximum of what the disk can deliver but doesn’t explain it all. Given the minor gain, we will park this one for future investigation.


How-to load data fast into SQL Server 2016
Monitoring the Memory Pool

Via the ‘sys.dm_resource_governor_resource_pools’ dmv can we check if our in-memory table leverages the newly created ‘PoolHK’ memory Pool:

SELECT pool_id , Name , min_memory_percent , max_memory_percent , max_memory_kb/1024 AS Max_memory_MB , used_memory_kb/1024 AS Used_memory_MB , target_memory_kb/1024 AS Target_memory_MB FROM sys.dm_resource_governor_resource_pools The output

Viewing all articles
Browse latest Browse all 3160

Trending Articles