By: Bhavesh Patel || Related Tips:More > Integration Services Development
ProblemAs a part of the data cleaning process we can use the data flow transformations Fuzzy Lookup and Fuzzy Grouping . Both can be used to standardize and correct data during the load process. A developer may confuse these two options due to similarities between the two transformations, so I will demonstrate the differences between these two components.
SolutionThe Fuzzy Lookup performs standardization of data by correcting and providing missing values. While the Fuzzy Grouping transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicated and selecting a canonical row of data to use in standardizing the data. We will demonstrate both of these transformations.
Setup SQL Server Test EnvironmentFirst, we will setup a test database Fuzzy_lookup with master table CustomerData and insert some test data.
USE MASTERGO
CREATE DATABASE Fuzzy_lookup
GO
USE Fuzzy_lookup
GO
CREATE TABLE CustomerData
(
custID INT PRIMARY KEY IDENTITY(1,1),
SAPCODE INT,
customerName NVARCHAR(200),
email varchar(100)
)
GO
INSERT INTO CustomerData
SELECT 101,'Bv patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="5a382c1a3b383974393537">[email protected]</a>'
UNION ALL
SELECT 102,'su patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="e29197a28596cc818d8f">[email protected]</a>'
UNION ALL
SELECT 103,'test','<a href="/cdn-cgi/l/email-protection" data-cfemail="d9adbcaaad99aaadf7bab6b4">[email protected]</a>'
UNION ALL
SELECT 104,'kb patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="5f343d1f2d2b713c3032">[email protected]</a>'
UNION ALL
SELECT 105,'kkb patel','<a href="/cdn-cgi/l/email-protection" data-cfemail="640f0f06240f17104a070b09">[email protected]</a>'
UNION ALL
SELECT 106,'bv patela','<a href="/cdn-cgi/l/email-protection" data-cfemail="52393930123921267c313d3f">[email protected]</a>'
GO Fuzzy Lookup Transformation in SQL Server Integration Services
The Fuzzy Lookup transformation is used for fuzzy matching (not exact but close matching). The lookup transformation uses an equi-join to locate matching records in the reference tables. To be more specific, it returns records with at least one matching record and also returns records with no matching records. However, the transformation requires at least one column match to be configured for fuzzy matching. If you want to use only exact matching, use the Lookup transformation instead.

There are three features for customizing this lookup.
Maximum number of matches to output per lookupYou can set this threshold based on a lookup per column. It means if you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.

Token delimiters
This provides a default set of delimiters. It's used to tokenize the data, but you can add custom token delimiters based on requirements to screening your data.

Similarity thresholds
This similarity threshold provides a decimal value between 0 and 1. The value 1 means an exact match between the values of fuzzy matching criteria for desired inputs. The confidence score 0 to 1, indicates the confidence in the match. If no usable match is found, similarity and confidence scores of 0 are assigned to the row and the output columns copied from the reference table will contain null values.

It provides two outputs:
_Similarity , a column that describes the similarity between values in the input and reference columns. _ Confidence , a column that describes the quality of the match. Fuzzy Lookup Example in SQL Server Integration ServicesI have CSV file customerData shown below. It has two columns name and customerPoints. We also have the Master table CustomerData that we created above. We will use the CSV file and match up against the CustomerData table for our matching.
This is what my finished package will look like:

Flat File Source
As mentioned, I am trying to demonstrate the Fuzzy Lookup transformation with respect to my use case. First, I will use a data flow task to perform this action which I named “GettingCustData”.

As part of the data flow task, I will use a flat file data source in order to get the customer file “customerData”.

Now in the data flow GettingCustData I have configured the above file as a flat file source. As an output of this, I have two columns name and customerPoints.

Now, I am going to add the Fuzzy Lookup transformation and link to Flat File Source.

Fuzzy Lookup: Reference Table
As properties of the reference table , I have used the sample master table “CustomerData” column name to link with the flat file column name.

Fuzzy Lookup: Columns
Here, we can configure the column mapping needed for the fuzzy matching.

Fuzzy Lookup: Advanced
I need an output once per lookup, so I used 1 for Maximum number of matches to output per lookup . I keep similarity_threshold 0 for visibility of observing the output of the fuzzy lookup with respect to similarities. Also, I kept the default token delimiters.

Add a Data Viewer to See Results
After applying the fuzzy transformation, I added a Data Viewer between the Fuzzy Lookup and Conditional Split so we can see what the data looks like. Below is the configuration for this.

I executed the package and the output for the data viewer is shown below.

We can see the output columns for the Similarity and Confidence. Also, we can see for the "test" there is no match. So, to handle these different matches, I will add a conditional split as shown below .
Conditional