How to delete duplicate rows with SQL

The comments on my article how to find duplicate rows with SQL made me realize I was simplifying things too much. In particular, I really glossed over the “how to delete duplicate rows” section, which I should have explained in more detail. I hope this article will remedy the omissions.

Update: I’ve now written another article on deleting duplicates when you have no primary key.

What’s wrong with what I wrote?

I assumed a beginner audience in my other article. Specifically, I assumed a beginner programmer who designs a table, connects some program to it, and then discovers a bunch of duplicate data because of incorrect table design. The programmer then has to find and delete the duplicate rows before putting unique indexes and primary keys on the data. And because I assumed a relatively small, non-mission-critical task, I suggested you can make a temporary table to aid in deleting the duplicate rows.

This isn’t realistic for a couple of reasons. First, as you showed me, many of you are beyond the simplistic ways to delete duplicates. Second, unless you have the data all to yourself, you need to find and delete the duplicates in an atomic operation, leaving no chance for more duplicates to sneak in while you’re working.

Two ways to do it

There are essentially two ways to do this: in one statement, or in multiple statements. If you do it in one statement, the statement is atomic, and you don’t need a transaction. If you use temporary tables or another method and use multiple statements, you have to do it all in a transaction.

It’s probably easier and better to do it in one statement, contrary to what I suggested previously.

How to delete duplicates with a single statement

The basic technique is to do a grouped self-join or subquery. That will make more sense in a moment.

First, familiarize yourself with the basic techniques I explained for finding the duplicate rows, especially the section on finding the data you need to delete them. This will become the innermost query in your self-join. Here’s the query I’ll use (refer to the other article for details):

select day, MIN(id)
from test
group by day
having count(*) > 1

You cannot delete from that result set, but you can delete by joining against it or using it in a subquery. First, I’ll show you how to self-join against the grouped query.

Technique 1: a grouped self-join

Place the find-bad-rows query into a subquery in the FROM clause, and join against it in such a way that the join will succeed only on rows you don’t want:

select bad_rows.*
from test as bad_rows
   inner join (
      select day, MIN(id) as min_id
      from test
      group by day
      having count(*) > 1
   ) as good_rows on good_rows.day = bad_rows.day
      and good_rows.min_id <> bad_rows.id;

Notice I’m joining on days that match and excluding the row I want to keep, the one with the minimum value for id. If that query returns the rows you don’t want, you’re good to go. All you have to do is put the DELETE in front of it:

delete bad_rows.*
from test as bad_rows
   inner join (
      select day, MIN(id) as min_id
      from test
      group by day
      having count(*) > 1
   ) as good_rows on good_rows.day = bad_rows.day
      and good_rows.min_id <> bad_rows.id;

The syntax will vary slightly depending on your RDBMS. I’ve written this for MySQL (MySQL users might also need to be careful about cross-database deletes). This will also only work on versions of MySQL where subqueries are implemented.

Technique 2: correlated subquery

The second method is to use a correlated subquery and place the find-bad-rows query inside the subquery. You can write these subqueries many different ways. Here’s one rather sub-optimal way:

delete test_outer.*
from test as test_outer
where exists(
   select *
   from test as test_inner
   where test_inner.day = test_outer.day
   group by day
   having count(*) > 1
      and min(test_inner.id) <> test_outer.id
);

This won’t even work on MySQL because it is trying to select from the same table it’s modifying. There are some silly tricks to get around this, which force intermediate materialization of the subquery, but in general you’re better off using the JOIN technique in MySQL.

[Edit: I originally listed an alternate query using a silly trick, which didn’t work (my mistake). I’ve removed that because that kind of query is a Stupid Thing To Do. If you are competent enough to write that query, I don’t need to tell you how. See the comments to follow the conversation about this.]

Technique 3: be clever about your data

Both the previous techniques rely on certain behaviors and database features. How about relying on the data itself to enable a smart, efficient deleting algorithm?

If most groups have duplicates, but there are not many duplicate rows within each group, this can be very efficient and doesn’t require any subqueries:

delete bad_rows.*
from test as good_rows
   inner join test as bad_rows on bad_rows.day = good_rows.day
      and bad_rows.id > good_rows.id;

This works because I decided I wanted to keep the row with the smallest id in each group. That means I can do a self-join that matches rows with a strict greater-than. Greater than what? The minimum value of id for that value of day, of course.

This is essentially a cross join, which is an O(n2) algorithm in the pathological worst case, but if there are not many duplicate rows, it’s basically the same cost as a regular join.

There are cases where this method is really terrible, too. For instance, you have a hundred million rows and only one duplicate row. You’d be joining a hundred million rows against a hundred million rows, eliminating all but one of them, and deleting it. That would be a very bad idea because it would take forever. It could be much more efficient to find the lone duplicate and delete just it, without doing all the joining.

Conclusion

I hope this article filled in some of the gaps in my other article. I’ve shown you at least three techniques for deleting duplicate rows with a single atomic query. Each is good for some scenarios and bad for others.

If you found this useful, you should subscribe by e-mail or feeds and get my upcoming articles conveniently.

Technorati Tags:No Tags

You might also like:

  1. How to find duplicate rows with SQL
  2. How to delete duplicate rows with SQL, Part 2
  3. How to find missing values in a sequence with SQL
  4. How to write a SQL exclusion join
  5. How to simulate FULL OUTER JOIN in MySQL

38 Responses to “How to delete duplicate rows with SQL”


  1. 1 fullbird

    The second options can’t work ok!

    mysql> delete test_outer.*
        -> from testdup as test_outer
        -> where exists(
        ->    select *
        ->    from (
        ->       select * from testdup
        ->       where b = test_outer.b
        ->    ) as test_inner
        ->    group by b
        ->    having count(*) > 1
        ->       and min(test_inner.a) <> test_outer.a
        -> );
    ERROR 1054 (42S22): Unknown column 'test_outer.b' in 'where clause'
  2. 2 Ryan Brooks

    Awesome article! Definately something that should be part of a cheat sheet, or something.

  3. 3 Xaprb

    fullbird, you probably have an earlier version of MySQL than I was using. Subqueries were somewhat awkward when MySQL first introduced them, and they have been refined over time.

  4. 4 fullbird

    xaprb, mysql version is 5.1.11-Beta.

  5. 5 Xaprb

    fullbird, I am getting the same error myself now. I promise I tested every query in this post. I’m trying to figure out what I did wrong. Maybe I accidentally tested a different query than I thought.

  6. 6 Xaprb

    fullbird, I don’t know what I was doing when I wrote the article, but it could not possibly have been the query I listed. It has never been possible to do this in MySQL with a correlated subquery, as explained in the MySQL manual section on subquery limitations:

    Subqueries in the FROM clause cannot be correlated subqueries. They are materialized (executed to produce a result set) before evaluating the outer query, so they cannot be evaluated per row of the outer query.

    I have known for a while that the subquery is materialized in this fashion (see my past articles) but I didn’t know this wouldn’t work. I assumed the subquery would be performed once per row in the outer table, causing one temporary table to be created per row in the outer query. This is why I said it is a stupid thing to do!

    I’m going to edit the article and remove this. It should be possible to rewrite the query other ways so it doesn’t run into this limitation, for example as an IN() subquery, but again it is a much less efficient query, so I think it’s best to leave this alone.

  7. 7 Outta Names

    Great article on fairly complicated queries!

    From a practical standpoint, I’ve been told it might be easier to delete duplicates by creating a new table (call it Table2) with the same design as the existing table (call it Table1), then in Table2, create a unique index on the column you are deduping (or a unique composite index on multiple columns e.g. lastname and phone if you are deduping against a combination of values) and then inserting the data from Table1 into Table2. After checking the data, simply delete Table1 and rename Table2 to Table1.

    That let’s the index do the deduping for you with little fussing over complicatd joins and subqueries.

  8. 8 Xaprb

    Yes, there was recently a thread on the MySQL Internals mailing list where this strategy was discussed too. Especially with certain storage engines, it might be a much more efficient way to do it, as it could cause the server a lot less work and result in a table that’s not fragmented (bad for concurrency in certain workloads).

  9. 9 Yashraj Arora

    if you want to keep one of the rows, in Sybase use delete TOP 1 from tablename where condition. So if you have two similar records, this will delete the first one.

  10. 10 Web developers

    Nice article. It is articles like these that still keep the Mysql open source fire burning.

  11. 11 bharathi

    but i didnt use primark key in my table. so now i want to know how to delete duplicate records without creating new table

  12. 12 Xaprb

    Bharathi, please see my most recent article.

  13. 13 sgtMilburn

    Hi, I tried the query in technique 1 and I’m getting this error:

    #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘. rid , bad_rows . page AND good_rows . min_id <> bad_rows . id

    All I did was to change my table/column names and add one column in the GROUP BY and AS sections. We are running 4.1.12 of MySQL.

    Here is my whole query:

    SELECT bad_rows.*
    FROM reportData AS bad_rows
        INNER JOIN (
            SELECT rid, page, MIN(id) AS min_id
            FROM reportData
            GROUP BY rid, page
            HAVING count(*) > 1
        ) AS good_rows ON good_rows.rid, good_rows.page = bad_rows.rid, bad_rows.page
          AND good_rows.min_id <> bad_rows.id

    Have I done something wrong or do I need a newer version of MySQL??

    Thanks for your help.

  14. 14 Xaprb

    Hi sgtMilburn, your join clause has a small syntax error. Try this:

    ) AS good_rows ON good_rows.rid = bad_rows.rid
          AND good_rows.page = bad_rows.page
          AND good_rows.min_id <> bad_rows.id
  15. 15 sgtMilburn

    Works great!! Thanks much!

  16. 16 Ken

    Great Article : I was searching for this and new it could be done but the HOW was evading me - I spent HOURS and HOURS trying to get it to work, all with that you need to put all in the group by when you have an aggregate, and the Need to precede with Exists ( yet always I was getting errors.) This is great. I wonder though on the idea to create a new table and then insert records into it. If I change the table to make a column unique (non primary key of course) - when I insert into the table from a table with duplicates - will I get an error since it cant insert but must update , how does that work ?

    Also the size of the table - for a large table - I wonder if it is really as efficient - since you must insert all rows into the new table.. where as you really only need to delete rows from the original table..

  17. 17 Xaprb

    Hi Ken, you could of course get errors. You’d have to unique-ify by all relevant columns so you didn’t get any conflicts. One thing MySQL can do that I’m not aware of in any other system is INSERT IGNORE, so if you’re using MySQL you might look into that.

    As far as efficiency on a large table, that really depends. It depends on how efficient your system is at deletes, and in MySQL’s case, which storage engine you’re using. It depends on whether rows are fixed or variable width, and if you have BLOB columns. It is also heavily dependent on how the table is indexed; a “simple delete” from a table with huge indexes might be much more work than you think, because all the indexes have to be rearranged. Sometimes, as strange as it sounds, it’s faster to build a new table fresh than punch holes in an old table and leave it all fragmented.

  18. 18 Paul

    This article helped me to come near the solution for my problem. It described the topic perfectly for me. But perhaps you can also give me a hint to solve the rest of my problem

    I have a table that includes a list of downloads. As it are the notify messages sent by cellphones this list naturally not only includes success messages. There are a lot of duplicate entries in it. But If a user tried to download more than once I can now find the entries and remove the duplicates thanks to your article. But there are also some “duplicates” in it where one of the entries is a Success message. This one I naturally want to keep.

    Your select, modified to fit my table looks like

    select bad_rows.*
    from cw10 as bad_rows
       inner join (
          select *, MIN(id) as min_id
          from cw10
          group by name, msisdn, fil_id, operator, language, useragent
          having count(*) > 1
       ) as good_rows on good_rows.name = bad_rows.name
          and good_rows.msisdn = bad_rows.msisdn
          and good_rows.language = bad_rows.language
          and good_rows.operator = bad_rows.operator
          and good_rows.useragent = bad_rows.useragent
          and good_rows.fil_id = bad_rows.fil_id
          and good_rows.min_id <> bad_rows.id;

    The success, failures and errors of a download are written into the messages field. And an entry can be identified by the combination of name, msisdn, fil_id, operator, language, useragent

    Is there a way to go through the table and remove any duplicates but if one of the duplicate messages is a success message keep that message.

    Thanks
    Paul

  19. 19 Xaprb

    Hi Paul, I would just put an appropriate WHERE clause everywhere it’s needed. I’m not crystal clear on what your data looks like, but just eliminating success messages from both SELECTs might do it.

  20. 20 Amian

    Thanks Xaprb, that’s an awesome work :)

  21. 21 David

    Thanks! It worked perfectly with MySQL server version 4.1 for my site! You are awesome!

  22. 22 Bata

    Is there any default column in sql server like “Rowid” in Oracle??

  23. 23 Xaprb

    No.

  24. 24 john goodman

    I work with a ton of data. On several reports that I receive and convert to access tables, there is duplicate data in some of the fields within a certain column. for example, there may be more than one account number, but several different charges that are listed on the end of the table. it may look like this:

    123456       xxxxxx       xxxxxx        xxxxxx      1.50
    123456       xxxxxx       xxxxxx        xxxxxx      15.50
    159753       xxxxxx       xxxxxx        xxxxxx      17.50

    And so on…
    I need it to only show the one set of (123456) but with the totals of all the accounts with that number. Please help

  25. 25 Xaprb

    That’s not duplicate data in my world :-) You need to use the GROUP BY clause. For faster help, try the #mysql IRC channel.

  26. 26 Arun Thomas

    It will not work if our table has text, ntext, or image data type
    we wil get an error as:
    The text, ntext, or image data type cannot be selected as DISTINCT.

    so add another identity column..
    then search for replicating data and delete with that new ID..
    It will work definitely..

    regard
    Arun Thomas

  27. 27 Rollo

    Really helpful stuff, thanks!

    But I’m having trouble adapting your Technique 1 for a particular purpose. I would like to eliminate duplicates only if they have a certain value for a given column (in this case “lang=’0′”). (I plan then to do a second pass to get rid of the rest by MIN id.)

    Seems to be a case for a WHERE clause but I just can’t figure out how to construct the query. Ideas appreciated!

  28. 28 satya

    Nice Artical, very usefull to bigners like me

    By,
    Satya

  29. 29 john.mocambique

    Technique #1 worked really well for the problem I was having. I used it to check for the max. Thanks.

  30. 30 Patrick Chevalier

    Thank you so much, this article is fantastic: I had exactly this problem and my solutions were too slow. With your ‘Technique 1′ I have solved my problem very quickly and the answer is very fast
    best regards

    Patrick Chevalier

  31. 31 BC_Man

    There’s probably some reason for having to do this internally to mySQL, but another way is this…

    sort -f -u -t , –key=2,51 cumindex.tab.txt > cumindex_unique.tab.txt

    Export the table, sort removing duplicate keys, import.

    BC_Man

  32. 32 Sam

    Very very helpful, and great discussion on the topic. Thanks!

  33. 33 shailesh

    if you want to remove duplicate records with ntext,text and image datatype .please do the below link stap.

    http://www.codegroups.com/blog/index.php/deleting-duplicate-records-from-table-with-text-ntext-or-image-data-type/

    I hope this is help !

    Regards,
    Shaileshk

  34. 34 IAyame

    Very nice post.
    I am trying to find duplicate entries based on different column in a joined table. In my senerio I have upto 5 columns that has to be identical to determine if the entry is duplicate.
    I am using SQL Server 2005. Any help would be grately appretiated.

  1. 1 zed23 » Blog Archive » How to delete duplicate rows with SQL
  2. 2 How to delete duplicate rows in a MySQL table | my-whiteboard
  3. 3 links for 2007-09-13 « manalang
  4. 4 MajMcDonald.com » My del.icio.us bookmarks for November 20th

Leave a Reply

Please do not use this blog to get help with problems or bugs in Maatkit or innotop: use the Sourceforge forums, mailing list, or bug trackers. If you're asking for help with MySQL, please use the MySQL mailing list instead. I'm writing a book and my time is extremely limited :-)