By:Sergey Gigoyan || Related Tips:More >Database Design
ProblemAccording to database design best practices , a SQL Server table should not contain duplicate rows. During the database design processprimary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that duplicates are no longer needed. Therefore an issue concerning the removal of duplicates from the loading table arises. In this tip let's examine some ways to solve data de-duplication needs.
SolutionWe will consider two cases in this tip:
The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed. The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed. Let's discuss these cases separately. Removing duplicates rows from a SQL Server table with a unique index Test Environment SetupTo accomplish our tasks we need a test environment:
USE masterGO
CREATE DATABASE TestDB
GO
USE TestDB
GO
CREATE TABLE TableA
(
ID INT NOT NULL IDENTITY(1,1),
Value INT,
CONSTRAINT PK_ID PRIMARY KEY(ID)
)
Now let's insert data into 'TableA':
USE TestDBGO
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
SELECT *
FROM TableA
SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value
As we can see the values 3 and 5 exists in the 'Value' column more than once:
Image may be NSFW.
Clik here to view.

Identify Duplicate Rows in the Table
Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the code below there are six solutions to find that duplicate values which should be deleted (leaving only one value):
----- Finding duplicate values in a table with a unique index--Solution 1
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL
--Solution 2
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL
--Solution 3
SELECT a.*
FROM
TableA a
INNER JOIN
(
SELECT MAX(ID) AS ID, Value
FROM TableA
GROUP BY Value
HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value
--Solution 4
SELECT a.*
FROM TableA a
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)
--Solution 5
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk > 1
--Solution 6
SELECT * FROM TableA
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableA
GROUP BY Value)
As we can see the result for all cases is the same:
Image may be NSFW.
Clik here to view.

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:
Image may be NSFW.
Clik here to view.

Deleting Duplicate Rows
Now by using these queries, let's delete duplicate values from the table. To simplify our process we will use only the second, the fifth and the sixth queries:
USE TestDBGO
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL)
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk>1
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE FROM TableA
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableA
GROUP BY Value)
Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE command and the slowest is the last as expected:
Image may be NSFW.
Clik here to view.

Removing duplicates from table without unique index in ORACLE
As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle. Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:
CREATE TABLE TableB (Value INT);INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);
Now we are selecting the data and ROWID from the table:
SELECT ROWID, Value FROM TableB;The result is below:
Image may be NSFW.
Clik here to view.

Now using ROWID we will easily remove duplicate rows from table:
DELETE TableBWHERE rowid not in (
SELECT MAX(rowid)
FROM TableB
GROUP BY Value
);
We can also remove duplicates using the code below:
DELETE from TableB oWHERE rowid < (
SELECT MAX(rowid)
FROM TableB i
WHERE i.Value=o.Value
GROUP BY Value
); Removing duplicates from a SQL Server table without unique index In SQL Server there is no equivalent to Oracle's ROWID, so to remove duplicates from the table without a unique index we need to do additional work for generating uniq