How to Easily Remove Duplicate Rows from a SQL Server Table

Cleaning up your database is essential for maintaining data quality, especially when it comes to duplicate rows. If you’re working with a large SQL Server table—over 300,000 rows, for example—you may encounter duplicates that you’d like to remove. In this blog post, we’ll walk you through a straightforward process to effectively eliminate duplicates while keeping the relevant data intact.

Understanding the Problem

When you have a table like MyTable, which includes a primary key with an identity field (RowID), duplicates don’t appear as perfect matches. Instead, they may vary in one or more non-key columns, such as Col1, Col2, and Col3. It’s essential to identify these duplicates smartly to avoid data loss while ensuring the integrity of your table.

Example Structure of MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

The Solution Explained

To effectively remove duplicate rows while keeping the relevant one, you can utilize SQL Server’s GROUP BY statements and DELETE commands. Below are the steps broken down for clarity.

Step-by-Step Guide

1. Grouping and Selecting Unique Rows

The first step is to group the rows by the columns that you want to check for duplicates. In this case, Col1, Col2, and Col3. You’ll use the MIN function to find the smallest RowID for each group of duplicates, which will guide you on which row to keep.

Here’s how the SQL code might look:

SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
FROM MyTable 
GROUP BY Col1, Col2, Col3

2. Delete Duplicates

Once you’ve identified which rows to keep, the next step is to delete everything that does not have a counterpart in your newly created KeepRows set. Here’s the SQL code to perform the deletion:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

This command does the following:

  • It performs a LEFT OUTER JOIN between MyTable and the calculated KeepRows.
  • Any row in MyTable that doesn’t match a RowId in KeepRows gets deleted.

Handling Unique Identifiers

If your table includes a GUID instead of an integer for row identification, simply adjust your MIN selection. Replace:

MIN(RowId)

With:

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

This ensures you are correctly identifying the minimum GUID while maintaining the data type integrity.

Conclusion

Removing duplicate rows from SQL Server can be accomplished efficiently using GROUP BY and their associated joining and deletion techniques. By following these steps, you can maintain a clean and functional database without risking the loss of important data. Always remember to back up your database before performing mass deletes for safety!

With the knowledge you’ve gained here, you can confidently tackle the issue of duplicates in your SQL tables. Happy querying!