SQL Articles

SQL query to delete duplicate rows


Create a table EmpDtl1 with some duplicate rows as shown below to understand different methods of delete duplicate rows.


    create table EmpDup(empid int,name varchar(20))

    insert into EmpDup values(1,'Andy')
    insert into EmpDup values (1,'Andy')
    insert into EmpDup values (2,'Chris')
    insert into EmpDup values(3,'Bill')
    insert into EmpDup values(3,'Bill')
    insert into EmpDup values (3,'Bill')       
    
Employee table with duplicate rows

User Name "Andy" repeated 2 times and User Name "Bill" repeated 3 times. See above picture.

Following are the 3 different methods for deleting duplicate rows.

Method 1:

Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows  table then insert all data from temporary table which has no duplicates as shown below.

       select distinct * into #tmp From EmpDup
       delete from EmpDup
       insert into EmpDup                
       select * from #tmp drop table #tmp
	
       select * from EmpDup
    

Output:

After delete duplicate rows

Method 2:

If you want to consider only few columns in a table for duplication criteria to delete rows then Method1 doesn't work. In our example assume if EMDup has one more column "hobbies" extra apart from empid , name but you want to delete duplicate records if empid and name are repeated irrespective of "hobbies" data column, in this case Method1 will not work and follow "Method2".

Delete duplicate rows using Common Table Expression(CTE)

   With CTE_Duplicates as
   (select empid,name , row_number() over(partition by empid,name order by empid,name ) rownumber 
   from EmpDup  )
   delete from CTE_Duplicates where rownumber!=1
   

In above query, the function row_number generates a row number in each row for the same empid,name group of result set. See screenshot of CTE table CTE_Duplicates. If row number is greater than 1 means it is a duplicate row and to be deleted. The query "delete from CTE_Duplicates where rownumber!=1" will delete the duplicate rows from table EmpDup.

Delete duplicate rows using CTE

Understanding CTE query is bit difficult so if you still don't understand how CTE is working, no worries see Method3 same job but different approach.

To understand how CTE works in depth , see my article on CTE here

Method 3:

There are two methods here to delete duplicates they are using "group by" and "Rank()"

Using "group by and min()":

Add an identity column to the duplicate table as a serial number that acts as a row unique identifier(auto incremental ascending order). Left join EmpDup table with aliasing table "T" which has lowest "sno" from each duplicates sets. From the result of left join, if T.sno is null means it is a duplicate and to be deleted.

See below example with query and screenshot.

Note: If you have ID column in your duplicate table then no need to add serial number column then delete later as shown in our example.

	    alter table EmpDup add  sno int identity(1,1)
	
	    delete E from EmpDup E
	    left join
	    (select min(sno) sno From EmpDup group by empid,name ) T on E.sno=T.sno
	    where T.sno is null

	    alter table EmpDup 
	    drop  column sno		   
    
Delete duplicates using group by and min()

Using "Rank()":

Add an identity column to the table as a serial number that acts as a row unique identifier(auto incremental ascending order).Then get the Rank against each empid,name based on serial number. If Rank is greater than 1 means it is a duplicate row and to be deleted. After deleting the duplicate rows, remove the identity column which is used for rank. See the below example.

Note: If you have ID column in your duplicate table then no need to add serial number column then delete later as shown in our example.

    alter table EmpDup add  sno int identity(1,1)

    delete E
    from  EmpDup E
    inner join
    (select *,
    RANK() OVER ( PARTITION BY empid,name ORDER BY sno DESC )rank
    From EmpDup )T on E.sno=t.sno
    where T.Rank>1

    alter table EmpDup 
    drop  column sno
    
Delete duplicates using Rank

Click here to know how RANK() works.

If you feel this query is little difficult to understand then use following query, same logic but in different way! that is instead of using "INNER JOIN" to delete duplicates(above query) use "IN" statement in query.

    alter table EmpDup add  sno int identity(1,1)
    delete from EmpDup where sno in
    (
    select sno from (
    select *,
    RANK() OVER ( PARTITION BY empid,name ORDER BY sno DESC )rank
    From EmpDup
    )T
    where rank>1
    )

    alter table EmpDup 
    drop  column sno
    

Use any one of above 3 methods to delete duplicate rows

Good to know about other scenarios of deleting duplicates:

If you want to delete duplicates of a  particular ID(here empid) then use"Delete Top( )" clause as shown below

    delete top(2) From EmpDup where empid=2
OR
    delete top(select count(*)-1 From EmpDup x where x.empid=2) From EmpDup where empid=2
    

If you want to delete all the rows if the selected columns repeated more than 1 time then use below query.

Query to delete 3 duplicated rows (in our example table) or repeated more than 1 time using "GROUP BY" clause.

       delete from EmpDup where EmpID in(select EmpID from EmpDup group by EmpId having
      count(*) >1)