Sunday, July 31, 2011

Delete duplicate rows in SQL Server

I had a difficult time researching how to delete duplicate rows in SQL Server when I encountered a customer database with masses (I mean millions) of records that were duplicates of each other in every way. This database was being exported to another vendor, who were simply going to export the contents to PDF charts, and they did not want to have to deal with the dupes. However these records were created, I had to get rid of all but one of each unique combination.

It turns out to be very easy to delete duplicates and still retain all the unique records that you want. Furthermore, the duplication conditions can be any combination of fields you want, just so long as you have at least one field that is always unique, such as an auto-incrementing counter or a time stamp.

Suppose the table name in question is "books" (Anyone remember Northwind Publishing?).  Let's say the fields that are duplicated are Author, Title, and Published Date. Any unique combination of these (by our own definition) represents one book; any other records with that same combination would be considered to be duplicates, and we would want to get rid of them and leave the just the first one in the table.

Lastly, let us assume that there is a unique auto-incrementing counter field, called "BookID". Here is how you would rid yourself of the pesky duplicates, leaving one of each unique record:

Delete t1
from [books] t1, [books] t2
where t1.[author] = t2.[author]
and t1.[title] = t2.[title]
and t1.[published date] = t2.[published date]
and t1.[bookid] > t2.[bookid]

In this SQL statement we have referenced the same table name twice, but by using aliases (t1 and t2) we can separate the references and use equality operators to select and delete duplicates. Note that the last clause uses a greater than operator (>) rather than equals (=), which will leave the record with the lowest "Bookid" value for each unique combo of the other fields. You can reverse this if you want to keep the highest value.

You can use time stamps in place of unique id fields, but there is sometimes a danger of these being identical, so you may miss deleting some of the records. It might be better to add a temporary field with a unique counter instead, and then delete the field after the deletion is complete.

An interesting thing to note is that if you replace the "delete t1 from" with "select * from", you will potentially have more rows returned than would be deleted. This is because if you have more than three identical records, one of them will be displayed twice: once in comparison to record one, once in comparison to record two. Record two would be displayed in comparison to record one, for a total of three records displayed. However, only two records would be deleted.

Labels: , , , ,