How to delete duplicate rows with SQL
The comments on my article how to find duplicate rows with SQL made me realize I was simplifying things too much. In particular, I really glossed over the “how to delete duplicate rows” section, which I should have explained in more detail. I hope this article will remedy the omissions.
Update: I’ve now written another article on deleting duplicates when you have no primary key.
What’s wrong with what I wrote?
I assumed a beginner audience in my other article. Specifically, I assumed a beginner programmer who designs a table, connects some program to it, and then discovers a bunch of duplicate data because of incorrect table design. The programmer then has to find and delete the duplicate rows before putting unique indexes and primary keys on the data. And because I assumed a relatively small, non-mission-critical task, I suggested you can make a temporary table to aid in deleting the duplicate rows.
This isn’t realistic for a couple of reasons. First, as you showed me, many of you are beyond the simplistic ways to delete duplicates. Second, unless you have the data all to yourself, you need to find and delete the duplicates in an atomic operation, leaving no chance for more duplicates to sneak in while you’re working.
Two ways to do it
There are essentially two ways to do this: in one statement, or in multiple statements. If you do it in one statement, the statement is atomic, and you don’t need a transaction. If you use temporary tables or another method and use multiple statements, you have to do it all in a transaction.
It’s probably easier and better to do it in one statement, contrary to what I suggested previously.
How to delete duplicates with a single statement
The basic technique is to do a grouped self-join or subquery. That will make more sense in a moment.
First, familiarize yourself with the basic techniques I explained for finding the duplicate rows, especially the section on finding the data you need to delete them. This will become the innermost query in your self-join. Here’s the query I’ll use (refer to the other article for details):
select day, MIN(id) from test group by day having count(*) > 1
You cannot delete from that result set, but you can delete by joining against it or using it in a subquery. First, I’ll show you how to self-join against the grouped query.
Technique 1: a grouped self-join
Place the find-bad-rows query into a subquery in the FROM clause, and join against it in such a way that the join will succeed only on rows you don’t want:
select bad_rows.*
from test as bad_rows
inner join (
select day, MIN(id) as min_id
from test
group by day
having count(*) > 1
) as good_rows on good_rows.day = bad_rows.day
and good_rows.min_id <> bad_rows.id;
Notice I’m joining on days that match and excluding the row I want to keep, the one with the minimum value for id. If that query returns the rows you don’t want, you’re good to go. All you have to do is put the DELETE in front of it:
delete bad_rows.*
from test as bad_rows
inner join (
select day, MIN(id) as min_id
from test
group by day
having count(*) > 1
) as good_rows on good_rows.day = bad_rows.day
and good_rows.min_id <> bad_rows.id;
The syntax will vary slightly depending on your RDBMS. I’ve written this for MySQL (MySQL users might also need to be careful about cross-database deletes). This will also only work on versions of MySQL where subqueries are implemented.
Technique 2: correlated subquery
The second method is to use a correlated subquery and place the find-bad-rows query inside the subquery. You can write these subqueries many different ways. Here’s one rather sub-optimal way:
delete test_outer.*
from test as test_outer
where exists(
select *
from test as test_inner
where test_inner.day = test_outer.day
group by day
having count(*) > 1
and min(test_inner.id) <> test_outer.id
);
This won’t even work on MySQL because it is trying to select from the same table it’s modifying. There are some silly tricks to get around this, which force intermediate materialization of the subquery, but in general you’re better off using the JOIN technique in MySQL.
[Edit: I originally listed an alternate query using a silly trick, which didn't work (my mistake). I've removed that because that kind of query is a Stupid Thing To Do. If you are competent enough to write that query, I don't need to tell you how. See the comments to follow the conversation about this.]
Technique 3: be clever about your data
Both the previous techniques rely on certain behaviors and database features. How about relying on the data itself to enable a smart, efficient deleting algorithm?
If most groups have duplicates, but there are not many duplicate rows within each group, this can be very efficient and doesn’t require any subqueries:
delete bad_rows.*
from test as good_rows
inner join test as bad_rows on bad_rows.day = good_rows.day
and bad_rows.id > good_rows.id;
This works because I decided I wanted to keep the row with the smallest id in each group. That means I can do a self-join that matches rows with a strict greater-than. Greater than what? The minimum value of id for that value of day, of course.
This is essentially a cross join, which is an O(n2) algorithm in the pathological worst case, but if there are not many duplicate rows, it’s basically the same cost as a regular join.
There are cases where this method is really terrible, too. For instance, you have a hundred million rows and only one duplicate row. You’d be joining a hundred million rows against a hundred million rows, eliminating all but one of them, and deleting it. That would be a very bad idea because it would take forever. It could be much more efficient to find the lone duplicate and delete just it, without doing all the joining.
Conclusion
I hope this article filled in some of the gaps in my other article. I’ve shown you at least three techniques for deleting duplicate rows with a single atomic query. Each is good for some scenarios and bad for others.
If you found this useful, you should subscribe by e-mail or feeds and get my upcoming articles conveniently.



The second options can’t work ok!
mysql> delete test_outer.* -> from testdup as test_outer -> where exists( -> select * -> from ( -> select * from testdup -> where b = test_outer.b -> ) as test_inner -> group by b -> having count(*) > 1 -> and min(test_inner.a) <> test_outer.a -> ); ERROR 1054 (42S22): Unknown column 'test_outer.b' in 'where clause'fullbird
12 Oct 06 at 4:13 am
Awesome article! Definately something that should be part of a cheat sheet, or something.
Ryan Brooks
12 Oct 06 at 4:24 am
[...] how-to-delete-duplicate-rows-with-sql/ [...]
zed23 » Blog Archive » How to delete duplicate rows with SQL
12 Oct 06 at 4:26 am
fullbird, you probably have an earlier version of MySQL than I was using. Subqueries were somewhat awkward when MySQL first introduced them, and they have been refined over time.
Xaprb
12 Oct 06 at 7:58 am
xaprb, mysql version is 5.1.11-Beta.
fullbird
13 Oct 06 at 10:04 pm
fullbird, I am getting the same error myself now. I promise I tested every query in this post. I’m trying to figure out what I did wrong. Maybe I accidentally tested a different query than I thought.
Xaprb
14 Oct 06 at 2:20 pm
fullbird, I don’t know what I was doing when I wrote the article, but it could not possibly have been the query I listed. It has never been possible to do this in MySQL with a correlated subquery, as explained in the MySQL manual section on subquery limitations:
I have known for a while that the subquery is materialized in this fashion (see my past articles) but I didn’t know this wouldn’t work. I assumed the subquery would be performed once per row in the outer table, causing one temporary table to be created per row in the outer query. This is why I said it is a stupid thing to do!
I’m going to edit the article and remove this. It should be possible to rewrite the query other ways so it doesn’t run into this limitation, for example as an
IN()subquery, but again it is a much less efficient query, so I think it’s best to leave this alone.Xaprb
14 Oct 06 at 8:21 pm
Great article on fairly complicated queries!
From a practical standpoint, I’ve been told it might be easier to delete duplicates by creating a new table (call it Table2) with the same design as the existing table (call it Table1), then in Table2, create a unique index on the column you are deduping (or a unique composite index on multiple columns e.g. lastname and phone if you are deduping against a combination of values) and then inserting the data from Table1 into Table2. After checking the data, simply delete Table1 and rename Table2 to Table1.
That let’s the index do the deduping for you with little fussing over complicatd joins and subqueries.
Outta Names
28 Oct 06 at 12:57 am
Yes, there was recently a thread on the MySQL Internals mailing list where this strategy was discussed too. Especially with certain storage engines, it might be a much more efficient way to do it, as it could cause the server a lot less work and result in a table that’s not fragmented (bad for concurrency in certain workloads).
Xaprb
29 Oct 06 at 9:31 am
if you want to keep one of the rows, in Sybase use
delete TOP 1 from tablename where condition. So if you have two similar records, this will delete the first one.Yashraj Arora
10 Nov 06 at 5:04 am
Nice article. It is articles like these that still keep the Mysql open source fire burning.
Web developers
11 Jan 07 at 3:15 am
but i didnt use primark key in my table. so now i want to know how to delete duplicate records without creating new table
bharathi
6 Feb 07 at 1:31 am
Bharathi, please see my most recent article.
Xaprb
6 Feb 07 at 9:07 am
Hi, I tried the query in technique 1 and I’m getting this error:
#1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘. rid , bad_rows . page AND good_rows . min_id <> bad_rows . id
All I did was to change my table/column names and add one column in the GROUP BY and AS sections. We are running 4.1.12 of MySQL.
Here is my whole query:
SELECT bad_rows.* FROM reportData AS bad_rows INNER JOIN ( SELECT rid, page, MIN(id) AS min_id FROM reportData GROUP BY rid, page HAVING count(*) > 1 ) AS good_rows ON good_rows.rid, good_rows.page = bad_rows.rid, bad_rows.page AND good_rows.min_id <> bad_rows.idHave I done something wrong or do I need a newer version of MySQL??
Thanks for your help.
sgtMilburn
28 Feb 07 at 2:31 pm
Hi sgtMilburn, your join clause has a small syntax error. Try this:
) AS good_rows ON good_rows.rid = bad_rows.rid AND good_rows.page = bad_rows.page AND good_rows.min_id <> bad_rows.idXaprb
28 Feb 07 at 3:19 pm
Works great!! Thanks much!
sgtMilburn
28 Feb 07 at 7:09 pm
Great Article : I was searching for this and new it could be done but the HOW was evading me – I spent HOURS and HOURS trying to get it to work, all with that you need to put all in the group by when you have an aggregate, and the Need to precede with Exists ( yet always I was getting errors.) This is great. I wonder though on the idea to create a new table and then insert records into it. If I change the table to make a column unique (non primary key of course) – when I insert into the table from a table with duplicates – will I get an error since it cant insert but must update , how does that work ?
Also the size of the table – for a large table – I wonder if it is really as efficient – since you must insert all rows into the new table.. where as you really only need to delete rows from the original table..
Ken
8 Mar 07 at 2:39 am
Hi Ken, you could of course get errors. You’d have to unique-ify by all relevant columns so you didn’t get any conflicts. One thing MySQL can do that I’m not aware of in any other system is INSERT IGNORE, so if you’re using MySQL you might look into that.
As far as efficiency on a large table, that really depends. It depends on how efficient your system is at deletes, and in MySQL’s case, which storage engine you’re using. It depends on whether rows are fixed or variable width, and if you have BLOB columns. It is also heavily dependent on how the table is indexed; a “simple delete” from a table with huge indexes might be much more work than you think, because all the indexes have to be rearranged. Sometimes, as strange as it sounds, it’s faster to build a new table fresh than punch holes in an old table and leave it all fragmented.
Xaprb
8 Mar 07 at 8:54 am
This article helped me to come near the solution for my problem. It described the topic perfectly for me. But perhaps you can also give me a hint to solve the rest of my problem
I have a table that includes a list of downloads. As it are the notify messages sent by cellphones this list naturally not only includes success messages. There are a lot of duplicate entries in it. But If a user tried to download more than once I can now find the entries and remove the duplicates thanks to your article. But there are also some “duplicates” in it where one of the entries is a Success message. This one I naturally want to keep.
Your select, modified to fit my table looks like
select bad_rows.* from cw10 as bad_rows inner join ( select *, MIN(id) as min_id from cw10 group by name, msisdn, fil_id, operator, language, useragent having count(*) > 1 ) as good_rows on good_rows.name = bad_rows.name and good_rows.msisdn = bad_rows.msisdn and good_rows.language = bad_rows.language and good_rows.operator = bad_rows.operator and good_rows.useragent = bad_rows.useragent and good_rows.fil_id = bad_rows.fil_id and good_rows.min_id <> bad_rows.id;The success, failures and errors of a download are written into the messages field. And an entry can be identified by the combination of name, msisdn, fil_id, operator, language, useragent
Is there a way to go through the table and remove any duplicates but if one of the duplicate messages is a success message keep that message.
Thanks
Paul
Paul
11 Mar 07 at 5:08 am
Hi Paul, I would just put an appropriate WHERE clause everywhere it’s needed. I’m not crystal clear on what your data looks like, but just eliminating success messages from both SELECTs might do it.
Xaprb
12 Mar 07 at 11:45 am
Thanks Xaprb, that’s an awesome work :)
Amian
13 Mar 07 at 6:28 pm
Thanks! It worked perfectly with MySQL server version 4.1 for my site! You are awesome!
David
24 Mar 07 at 11:50 pm
[...] There is another good article about deleting duplicate rows. Keywords: none [...]
How to delete duplicate rows in a MySQL table | my-whiteboard
3 Apr 07 at 3:04 pm
Is there any default column in sql server like “Rowid” in Oracle??
Bata
10 Apr 07 at 6:31 am
No.
Xaprb
10 Apr 07 at 7:38 am
I work with a ton of data. On several reports that I receive and convert to access tables, there is duplicate data in some of the fields within a certain column. for example, there may be more than one account number, but several different charges that are listed on the end of the table. it may look like this:
And so on…
I need it to only show the one set of (123456) but with the totals of all the accounts with that number. Please help
john goodman
11 Apr 07 at 8:49 am
That’s not duplicate data in my world :-) You need to use the GROUP BY clause. For faster help, try the #mysql IRC channel.
Xaprb
11 Apr 07 at 2:20 pm
It will not work if our table has text, ntext, or image data type
we wil get an error as:
The text, ntext, or image data type cannot be selected as DISTINCT.
so add another identity column..
then search for replicating data and delete with that new ID..
It will work definitely..
regard
Arun Thomas
Arun Thomas
11 Jul 07 at 2:54 am
Really helpful stuff, thanks!
But I’m having trouble adapting your Technique 1 for a particular purpose. I would like to eliminate duplicates only if they have a certain value for a given column (in this case “lang=’0′”). (I plan then to do a second pass to get rid of the rest by MIN id.)
Seems to be a case for a WHERE clause but I just can’t figure out how to construct the query. Ideas appreciated!
Rollo
24 Jul 07 at 9:23 pm
Nice Artical, very usefull to bigners like me
By,
Satya
satya
30 Jul 07 at 8:33 am
Technique #1 worked really well for the problem I was having. I used it to check for the max. Thanks.
john.mocambique
15 Aug 07 at 11:53 am
[...] How to delete duplicate rows with SQL at Xaprb excellent reference on how to delete dup rows in sql (tags: mysql database howto sql duplicates tips) [...]
links for 2007-09-13 « manalang
13 Sep 07 at 12:23 am
Thank you so much, this article is fantastic: I had exactly this problem and my solutions were too slow. With your ‘Technique 1′ I have solved my problem very quickly and the answer is very fast
best regards
Patrick Chevalier
Patrick Chevalier
21 Oct 07 at 4:01 am
There’s probably some reason for having to do this internally to mySQL, but another way is this…
sort -f -u -t , –key=2,51 cumindex.tab.txt > cumindex_unique.tab.txt
Export the table, sort removing duplicate keys, import.
BC_Man
BC_Man
1 Nov 07 at 8:03 pm
[...] How to delete duplicate rows with SQL at Xaprb – [...]
MajMcDonald.com » My del.icio.us bookmarks for November 20th
26 Nov 07 at 11:56 am
Very very helpful, and great discussion on the topic. Thanks!
Sam
28 Nov 07 at 6:39 am
if you want to remove duplicate records with ntext,text and image datatype .please do the below link stap.
http://www.codegroups.com/blog/index.php/deleting-duplicate-records-from-table-with-text-ntext-or-image-data-type/
I hope this is help !
Regards,
Shaileshk
shailesh
11 Apr 08 at 4:27 am
Very nice post.
I am trying to find duplicate entries based on different column in a joined table. In my senerio I have upto 5 columns that has to be identical to determine if the entry is duplicate.
I am using SQL Server 2005. Any help would be grately appretiated.
IAyame
8 May 08 at 5:23 pm
super stuff got me out of a jam – was using a select top 1 querying the same table twice with two aliases – the technique 1 is excellent
TheBigMG
7 Aug 08 at 6:04 am
works for me. excellent. thanks!
Mark
14 Oct 08 at 8:02 am
just for the record my version is: MySQL 6.0.4-alpha community
Mark
14 Oct 08 at 8:04 am
Hi, how do i list all my salesman-customers pairs who made the largest number of transactions together, whereby the customer has never complain against the salesman? That is, i have a lot of pairs, but i want to select all sellers to be distinct with their most popular customer
noobsql
15 Oct 08 at 4:34 am
Technique 1 rocks. Thanks!
This is a great blog for developers
Matias Alejo Garcia
20 Oct 08 at 10:11 am
Dude, you’re a lifesaver – thanks a ton for this great article!
Renie
27 Oct 08 at 2:27 am
Thank’s for the post
hackus
27 Oct 08 at 7:00 am
I attempted to use this on a database that I manage, and I received the following error:
#1064 – You have an error in your SQL syntax. Check the manual that corresponds to your MySQL server version for the right syntax to use near ’select ip, MIN(id) as min_id
from sab_response
grou
The query that I have is:
select bad_rows.*
from sab_response as bad_rows
inner join (
select ip, MIN(id) as min_id
from sab_response
group by ip
having count(*) > 1
) as good_rows on good_rows.ip = bad_rows.ip
and good_rows.min_id bad_rows.id
The queries work individually, so I am lead to believe that the problem lies in the join. The server is running MySQL 4.1.22.
Thanks!
rbright
7 Nov 08 at 10:39 am
[...] More information on this deletion query and other methods of deleting duplicates can be found at Xaprb.com. [...]
Deleting dublicated rows in MySQL | Ersin Acar
9 Apr 09 at 5:31 am
Delete duplicate records in Sql Server 2005 helped me out
yogesh
25 Apr 09 at 4:58 am
Thank you so much for the valuable information on removing redundancy from a table. Please follow the link below to get more information on Deleting duplicate records from a table efficiently.
http://www.sqllion.com/2009/05/delete-duplicate-records/
sql lion
28 May 09 at 2:56 pm
[...] information on this deletion query and other methods of deleting duplicates can be found at Xaprb.com.Written by Philip [...]
Finding Duplicate Values In A MySQL Table | #! code
3 Sep 09 at 1:08 pm
This article is very impressive, superb!
Roy Bellingan
24 Dec 09 at 4:51 pm
Thank you so much as you have save my life, i was searching….searching…searching to find how to remove duplicate. you way the best. keep up the good work
timor
15 Mar 10 at 9:32 am
Hi. I’m adding my “thanks” to the “thank_you row”.
jonez1
18 Apr 10 at 3:26 pm
Perfect – technique 3 worked for me! Thanks for the post!
Chris
21 May 10 at 7:12 am