Find and Remove Duplicate Rows from a SQL Server Table

By:Sergey Gigoyan || Related Tips:More >Database Design

Problem

According to database design best practices , a SQL Server table should not contain duplicate rows. During the database design processprimary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that duplicates are no longer needed. Therefore an issue concerning the removal of duplicates from the loading table arises. In this tip let's examine some ways to solve data de-duplication needs.

Solution

We will consider two cases in this tip:

The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed. The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed. Let's discuss these cases separately. Removing duplicates rows from a SQL Server table with a unique index Test Environment Setup

To accomplish our tasks we need a test environment:

USE master
GO
CREATE DATABASE TestDB
GO
USE TestDB
GO
CREATE TABLE TableA
(
ID INT NOT NULL IDENTITY(1,1),
Value INT,
CONSTRAINT PK_ID PRIMARY KEY(ID)
)

Now let's insert data into 'TableA':

USE TestDB
GO
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
SELECT *
FROM TableA
SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value

As we can see the values 3 and 5 exists in the 'Value' column more than once:

Image may be NSFW.
Clik here to view. Find and Remove Duplicate Rows from a SQL Server Table

Identify Duplicate Rows in the Table

Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the code below there are six solutions to find that duplicate values which should be deleted (leaving only one value):

----- Finding duplicate values in a table with a unique index
--Solution 1
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL
--Solution 2
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL
--Solution 3
SELECT a.*
FROM
TableA a
INNER JOIN
(
SELECT MAX(ID) AS ID, Value
FROM TableA
GROUP BY Value
HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value
--Solution 4
SELECT a.*
FROM TableA a
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)
--Solution 5
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk > 1
--Solution 6
SELECT * FROM TableA
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableA
GROUP BY Value)

As we can see the result for all cases is the same:

Image may be NSFW.
Clik here to view. Find and Remove Duplicate Rows from a SQL Server Table

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:

Image may be NSFW.
Clik here to view. Find and Remove Duplicate Rows from a SQL Server Table

Deleting Duplicate Rows

Now by using these queries, let's delete duplicate values from the table. To simplify our process we will use only the second, the fifth and the sixth queries:

USE TestDB
GO
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL)
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk>1
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE FROM TableA
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableA
GROUP BY Value)

Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE command and the slowest is the last as expected:

Image may be NSFW.
Clik here to view. Find and Remove Duplicate Rows from a SQL Server Table

Removing duplicates from table without unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle. Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:

CREATE TABLE TableB (Value INT);
INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);

Now we are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;

The result is below:

Image may be NSFW.
Clik here to view. Find and Remove Duplicate Rows from a SQL Server Table

Now using ROWID we will easily remove duplicate rows from table:

DELETE TableB
WHERE rowid not in (
SELECT MAX(rowid)
FROM TableB
GROUP BY Value
);

We can also remove duplicates using the code below:

DELETE from TableB o
WHERE rowid < (
SELECT MAX(rowid)
FROM TableB i
WHERE i.Value=o.Value
GROUP BY Value
); Removing duplicates from a SQL Server table without unique index In SQL Server there is no equivalent to Oracle's ROWID, so to remove duplicates from the table without a unique index we need to do additional work for generating uniq

Find and Remove Duplicate Rows from a SQL Server Table

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本