Quantcast
Channel: CodeSection,代码区,SQL Server(mssql)数据库 技术分享 - CodeSec
Viewing all articles
Browse latest Browse all 3160

Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

$
0
0

By: Bhavesh Patel || Related Tips:More > Integration Services Development

Problem

As a part of the data cleaning process we can use the data flow transformations Fuzzy Lookup and Fuzzy Grouping . Both can be used to standardize and correct data during the load process. A developer may confuse these two options due to similarities between the two transformations, so I will demonstrate the differences between these two components.

Solution

The Fuzzy Lookup performs standardization of data by correcting and providing missing values. While the Fuzzy Grouping transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicated and selecting a canonical row of data to use in standardizing the data. We will demonstrate both of these transformations.

Setup SQL Server Test Environment

First, we will setup a test database Fuzzy_lookup with master table CustomerData and insert some test data.

USE MASTER
GO
CREATE DATABASE Fuzzy_lookup
GO
USE Fuzzy_lookup
GO
CREATE TABLE CustomerData
(
custID INT PRIMARY KEY IDENTITY(1,1),
SAPCODE INT,
customerName NVARCHAR(200),
email varchar(100)
)
GO
INSERT INTO CustomerData
SELECT 101,'Bv patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="5a382c1a3b383974393537">[email protected]</a>'
UNION ALL
SELECT 102,'su patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="e29197a28596cc818d8f">[email protected]</a>'
UNION ALL
SELECT 103,'test','<a href="/cdn-cgi/l/email-protection" data-cfemail="d9adbcaaad99aaadf7bab6b4">[email protected]</a>'
UNION ALL
SELECT 104,'kb patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="5f343d1f2d2b713c3032">[email protected]</a>'
UNION ALL
SELECT 105,'kkb patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="640f0f06240f17104a070b09">[email protected]</a>'
UNION ALL
SELECT 106,'bv patela','<a href="/cdn-cgi/l/email-protection" data-cfemail="52393930123921267c313d3f">[email protected]</a>'
GO Fuzzy Lookup Transformation in SQL Server Integration Services

The Fuzzy Lookup transformation is used for fuzzy matching (not exact but close matching). The lookup transformation uses an equi-join to locate matching records in the reference tables. To be more specific, it returns records with at least one matching record and also returns records with no matching records. However, the transformation requires at least one column match to be configured for fuzzy matching. If you want to use only exact matching, use the Lookup transformation instead.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

There are three features for customizing this lookup.

Maximum number of matches to output per lookup

You can set this threshold based on a lookup per column. It means if you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Token delimiters

This provides a default set of delimiters. It's used to tokenize the data, but you can add custom token delimiters based on requirements to screening your data.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Similarity thresholds

This similarity threshold provides a decimal value between 0 and 1. The value 1 means an exact match between the values of fuzzy matching criteria for desired inputs. The confidence score 0 to 1, indicates the confidence in the match. If no usable match is found, similarity and confidence scores of 0 are assigned to the row and the output columns copied from the reference table will contain null values.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

It provides two outputs:

_Similarity , a column that describes the similarity between values in the input and reference columns. _ Confidence , a column that describes the quality of the match. Fuzzy Lookup Example in SQL Server Integration Services

I have CSV file customerData shown below. It has two columns name and customerPoints. We also have the Master table CustomerData that we created above. We will use the CSV file and match up against the CustomerData table for our matching.

This is what my finished package will look like:


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Flat File Source

As mentioned, I am trying to demonstrate the Fuzzy Lookup transformation with respect to my use case. First, I will use a data flow task to perform this action which I named “GettingCustData”.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

As part of the data flow task, I will use a flat file data source in order to get the customer file “customerData”.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

Now in the data flow GettingCustData I have configured the above file as a flat file source. As an output of this, I have two columns name and customerPoints.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

Now, I am going to add the Fuzzy Lookup transformation and link to Flat File Source.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Fuzzy Lookup: Reference Table

As properties of the reference table , I have used the sample master table “CustomerData” column name to link with the flat file column name.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Fuzzy Lookup: Columns

Here, we can configure the column mapping needed for the fuzzy matching.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Fuzzy Lookup: Advanced

I need an output once per lookup, so I used 1 for Maximum number of matches to output per lookup . I keep similarity_threshold 0 for visibility of observing the output of the fuzzy lookup with respect to similarities. Also, I kept the default token delimiters.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services
Add a Data Viewer to See Results

After applying the fuzzy transformation, I added a Data Viewer between the Fuzzy Lookup and Conditional Split so we can see what the data looks like. Below is the configuration for this.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

I executed the package and the output for the data viewer is shown below.


Fuzzy Transformation and Fuzzy Grouping in SQL Server Integration Services

We can see the output columns for the Similarity and Confidence. Also, we can see for the "test" there is no match. So, to handle these different matches, I will add a conditional split as shown below .

Conditional

Viewing all articles
Browse latest Browse all 3160

Trending Articles