Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Data Sampling in SQL Server Integration Services

$
0
0

By:Nat Sundar || Related Tips:More > Integration Services Development

Problem

When developingSSIS packages, sometimes it is easier to work with a sample dataset instead of the entire dataset. In this tip we look at a couple of different ways to do data sampling transformations in SSIS.

Solution

There are two transformations in SSIS that will help us generate a smaller subset of data based on sampling. They are row sampling and percentage sampling transformations.

Usage of SSIS sampling transformation Instead of using the full production dataset, a sample can be used to test the package multiple times for different conditions. This can be very helpful to replicate production bugs on a development server. Based on the environment capacity (Dev, Test & UAT servers), the test data size can be adjusted easily. Allows you to test data mining models with different test data. Allows for data profiling activities to understand the quality of the source data. SSIS Row Sampling Transformation

The SSIS Row Sampling transformation generates a sample dataset based on a random selection from the input dataset. The expected number of sample records must be entered in the transformation. By default the transformation generates a random number to select the input rows. So every time you run the transformation you may a get different set of records. This is very useful, as you may want to test the data flow for different sets of representative data.

However if you prefer to have the same set of data every time, then you can define a seed. This can be done by enabling the option "Use the following random seed".

This transformation generates two outputs. The "sampling selected output" will provide the sample dataset based on the number of rows keyed in and the rest of the records will be available thru the output "Sampling unselected output".

The sampling seed and the sampling value properties can also be set using the properties window.

In the below image, we are extracting data from Fact InternetSales and would like to sample 100 records. The row sampling transformation has been added as below.


Data Sampling in SQL Server Integration Services

The sampling value (number of rows) has been set to 100. Also the sampling seed has been set to 1, so that we get the same records each time we this is run.


Data Sampling in SQL Server Integration Services

After successful execution, we can see the sampling records thru the "Sampling Selected Output".


Data Sampling in SQL Server Integration Services

The data viewer shows the sample output.


Data Sampling in SQL Server Integration Services
SSIS Percentage Sampling

The SSIS Percentage Sampling transformation works the same way as the Row Sampling transformation except we provide the percentage of sample records instead of a specific record count.

The Percentage Sampling transformation uses an algorithm to determine whether a row should be included in the sample output. Because of the functionality of the algorithm, the number of output rows will not exactly match the specified percentage. Let's assume we have a source dataset with 10,000 records and you want to inspect the data quality by sampling 10% of the dataset. Ideally you expect to see 1000 records, but the actual records sampled by SSIS may be greater or less than 1000 records.

In the below image, we are extracting data from Fact InternetSales and would like to sample 10% of the records. The percentage sampling transformation has been added as below.


Data Sampling in SQL Server Integration Services

The sampling value (percentage of rows) has been set to 1.


Data Sampling in SQL Server Integration Services

After successful execution, we can see the sampling records thru the "Sampling Selected Output".


Data Sampling in SQL Server Integration Services
Summary

The Sampling transformations has a wide variety of usage. In data mining analysis, both the selected and unselected outputs can be used to design and test the model. Both the row and percentage sampling transformations are easy to learn and configure.

Next Steps Read more about Row Sampling Read more about Percentage Sampling Last Update: 12/20/2016

Viewing all articles
Browse latest Browse all 3160

Trending Articles