Optimize Large SQL Server DML Processes by Using Batches

By:Eduardo Pivaral | Last Updated: 2018-11-13 || Related Tips:More > T-SQL

Problem

Sometimes you must perform DML processes (insert, update, delete or combinations of these) over large tables, if your SQL Server database has a high concurrency, these types of processes can lead to locks, or filling the transaction log, even if you run these processes outside of business hours. So maybe you were asked to optimize some processes to avoid large log growths or minimizing locks to tables and by optimizing these processes some execution time can be saved as well. How can this be achieved?

In this previoustip we discussed how to perform a batch solution for tables with sequential numeric keys, but how about tables without sequential numeric keys? In this tip we will show another approach for these types of tables.

Solution

We will do these DML processes using batches with the help of the TOP statement. NOTE: We could also use SET ROWCOUNT, but this function will be deprecated in a future release of SQL Server, so I do not recommend using it. This also gives you the ability to implement custom "stop-resume" logic. We will show you a general method, so you can take it as a base to implement it on your own processes.

For this method, a sequential numeric key is not required, but we still require a key, so if you need to perform an operation over a large Heap, you will need to make sure there is a key for the table.

Basic Algorithm

The basic batch process is something like this:

DECLARE @id_control INT
DECLARE @batchSize INT
DECLARE @results INT
SET @results = 1 --stores the row count after each successful batch
SET @batchSize = 10000 --How many rows you want to operate on each batch
SET @id_control = -1 --a value lower than your lowest key value
-- when 0 rows returned, exit the loop
WHILE (@results > 0)
BEGIN
-- put your custom code here
UPDATE top(@batchSize) <YOUR_TABLE> -- or any other DML command
SET <your operations>
, @id_control = [<your key column>] --it takes the latest value updated to help next iteration
WHERE [<your key column>] > @id_control -- filter rows already processed
AND <additional evaluations you could need>
-- End of your custom operations
-- very important to obtain the latest rowcount to avoid infinite loops
SET @results = @@ROWCOUNT
END Explanation of Code

We will use the TOP command to limit the number of rows processed for each batch.

Also, we included a control variable to filter the next iteration (if you use a delete, you can remove the filter, since you don’t need to track the deleted rows, so for deletes it will be simpler) and we iterate through it until the row count is 0 rows.

So, with this approach we don’t need sequential numeric keys, just any key to uniquely identify each row (to avoid processing more rows than we really need).

Important Note:You might notice the complexity of this approach that handles the drawbacks from theprevious tip, that was much more simpler,so you must be extra careful testing it before moving into production environments.

Sample Table and Data Set We will use a test table [test_items] with this definition: CREATE TABLE [dbo].[test_items](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[date] [smalldatetime] NOT NULL,
[quantity] [int] NOT NULL,
[value] [money] NOT NULL,
[description] [nvarchar](500) NOT NULL,
[enabled] [bit] NOT NULL,
[category1] [tinyint] NOT NULL,
[category2] [tinyint] NOT NULL,
CONSTRAINT [PK_test_items] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I then inserted 9,126,650 random records.

UPDATE Statement in Single Batch

We will take this as a base UPDATE inside a transaction. Update the varchar field with random data (so our test is more real), and after clearing the cache, we proceed to execute it.

Here is a screenshot of the transaction log before the operation.