Xaprb

Stay curious!

How to delete duplicate rows with SQL

with 54 comments

The comments on my article how to find duplicate rows with SQL made me realize I was simplifying things too much. In particular, I really glossed over the “how to delete duplicate rows” section, which I should have explained in more detail. I hope this article will remedy the omissions.

Update: I’ve now written another article on deleting duplicates when you have no primary key.

What’s wrong with what I wrote?

I assumed a beginner audience in my other article. Specifically, I assumed a beginner programmer who designs a table, connects some program to it, and then discovers a bunch of duplicate data because of incorrect table design. The programmer then has to find and delete the duplicate rows before putting unique indexes and primary keys on the data. And because I assumed a relatively small, non-mission-critical task, I suggested you can make a temporary table to aid in deleting the duplicate rows.

This isn’t realistic for a couple of reasons. First, as you showed me, many of you are beyond the simplistic ways to delete duplicates. Second, unless you have the data all to yourself, you need to find and delete the duplicates in an atomic operation, leaving no chance for more duplicates to sneak in while you’re working.

Two ways to do it

There are essentially two ways to do this: in one statement, or in multiple statements. If you do it in one statement, the statement is atomic, and you don’t need a transaction. If you use temporary tables or another method and use multiple statements, you have to do it all in a transaction.

It’s probably easier and better to do it in one statement, contrary to what I suggested previously.

How to delete duplicates with a single statement

The basic technique is to do a grouped self-join or subquery. That will make more sense in a moment.

First, familiarize yourself with the basic techniques I explained for finding the duplicate rows, especially the section on finding the data you need to delete them. This will become the innermost query in your self-join. Here’s the query I’ll use (refer to the other article for details):

select day, MIN(id)
from test
group by day
having count(*) > 1

You cannot delete from that result set, but you can delete by joining against it or using it in a subquery. First, I’ll show you how to self-join against the grouped query.

Technique 1: a grouped self-join

Place the find-bad-rows query into a subquery in the FROM clause, and join against it in such a way that the join will succeed only on rows you don’t want:

select bad_rows.*
from test as bad_rows
   inner join (
      select day, MIN(id) as min_id
      from test
      group by day
      having count(*) > 1
   ) as good_rows on good_rows.day = bad_rows.day
      and good_rows.min_id <> bad_rows.id;

Notice I’m joining on days that match and excluding the row I want to keep, the one with the minimum value for id. If that query returns the rows you don’t want, you’re good to go. All you have to do is put the DELETE in front of it:

delete bad_rows.*
from test as bad_rows
   inner join (
      select day, MIN(id) as min_id
      from test
      group by day
      having count(*) > 1
   ) as good_rows on good_rows.day = bad_rows.day
      and good_rows.min_id <> bad_rows.id;

The syntax will vary slightly depending on your RDBMS. I’ve written this for MySQL (MySQL users might also need to be careful about cross-database deletes). This will also only work on versions of MySQL where subqueries are implemented.

Technique 2: correlated subquery

The second method is to use a correlated subquery and place the find-bad-rows query inside the subquery. You can write these subqueries many different ways. Here’s one rather sub-optimal way:

delete test_outer.*
from test as test_outer
where exists(
   select *
   from test as test_inner
   where test_inner.day = test_outer.day
   group by day
   having count(*) > 1
      and min(test_inner.id) <> test_outer.id
);

This won’t even work on MySQL because it is trying to select from the same table it’s modifying. There are some silly tricks to get around this, which force intermediate materialization of the subquery, but in general you’re better off using the JOIN technique in MySQL.

[Edit: I originally listed an alternate query using a silly trick, which didn't work (my mistake). I've removed that because that kind of query is a Stupid Thing To Do. If you are competent enough to write that query, I don't need to tell you how. See the comments to follow the conversation about this.]

Technique 3: be clever about your data

Both the previous techniques rely on certain behaviors and database features. How about relying on the data itself to enable a smart, efficient deleting algorithm?

If most groups have duplicates, but there are not many duplicate rows within each group, this can be very efficient and doesn’t require any subqueries:

delete bad_rows.*
from test as good_rows
   inner join test as bad_rows on bad_rows.day = good_rows.day
      and bad_rows.id > good_rows.id;

This works because I decided I wanted to keep the row with the smallest id in each group. That means I can do a self-join that matches rows with a strict greater-than. Greater than what? The minimum value of id for that value of day, of course.

This is essentially a cross join, which is an O(n2) algorithm in the pathological worst case, but if there are not many duplicate rows, it’s basically the same cost as a regular join.

There are cases where this method is really terrible, too. For instance, you have a hundred million rows and only one duplicate row. You’d be joining a hundred million rows against a hundred million rows, eliminating all but one of them, and deleting it. That would be a very bad idea because it would take forever. It could be much more efficient to find the lone duplicate and delete just it, without doing all the joining.

Conclusion

I hope this article filled in some of the gaps in my other article. I’ve shown you at least three techniques for deleting duplicate rows with a single atomic query. Each is good for some scenarios and bad for others.

If you found this useful, you should subscribe by e-mail or feeds and get my upcoming articles conveniently.

Written by Xaprb

October 11th, 2006 at 11:31 pm

Posted in SQL

54 Responses to 'How to delete duplicate rows with SQL'

Subscribe to comments with RSS

  1. The second options can’t work ok!

    mysql> delete test_outer.*
        -> from testdup as test_outer
        -> where exists(
        ->    select *
        ->    from (
        ->       select * from testdup
        ->       where b = test_outer.b
        ->    ) as test_inner
        ->    group by b
        ->    having count(*) > 1
        ->       and min(test_inner.a) <> test_outer.a
        -> );
    ERROR 1054 (42S22): Unknown column 'test_outer.b' in 'where clause'

    fullbird

    12 Oct 06 at 4:13 am

  2. Awesome article! Definately something that should be part of a cheat sheet, or something.

    Ryan Brooks

    12 Oct 06 at 4:24 am

  3. [...] how-to-delete-duplicate-rows-with-sql/ [...]

  4. fullbird, you probably have an earlier version of MySQL than I was using. Subqueries were somewhat awkward when MySQL first introduced them, and they have been refined over time.

    Xaprb

    12 Oct 06 at 7:58 am

  5. xaprb, mysql version is 5.1.11-Beta.

    fullbird

    13 Oct 06 at 10:04 pm

  6. fullbird, I am getting the same error myself now. I promise I tested every query in this post. I’m trying to figure out what I did wrong. Maybe I accidentally tested a different query than I thought.

    Xaprb

    14 Oct 06 at 2:20 pm

  7. fullbird, I don’t know what I was doing when I wrote the article, but it could not possibly have been the query I listed. It has never been possible to do this in MySQL with a correlated subquery, as explained in the MySQL manual section on subquery limitations:

    Subqueries in the FROM clause cannot be correlated subqueries. They are materialized (executed to produce a result set) before evaluating the outer query, so they cannot be evaluated per row of the outer query.

    I have known for a while that the subquery is materialized in this fashion (see my past articles) but I didn’t know this wouldn’t work. I assumed the subquery would be performed once per row in the outer table, causing one temporary table to be created per row in the outer query. This is why I said it is a stupid thing to do!

    I’m going to edit the article and remove this. It should be possible to rewrite the query other ways so it doesn’t run into this limitation, for example as an IN() subquery, but again it is a much less efficient query, so I think it’s best to leave this alone.

    Xaprb

    14 Oct 06 at 8:21 pm

  8. Great article on fairly complicated queries!

    From a practical standpoint, I’ve been told it might be easier to delete duplicates by creating a new table (call it Table2) with the same design as the existing table (call it Table1), then in Table2, create a unique index on the column you are deduping (or a unique composite index on multiple columns e.g. lastname and phone if you are deduping against a combination of values) and then inserting the data from Table1 into Table2. After checking the data, simply delete Table1 and rename Table2 to Table1.

    That let’s the index do the deduping for you with little fussing over complicatd joins and subqueries.

    Outta Names

    28 Oct 06 at 12:57 am

  9. Yes, there was recently a thread on the MySQL Internals mailing list where this strategy was discussed too. Especially with certain storage engines, it might be a much more efficient way to do it, as it could cause the server a lot less work and result in a table that’s not fragmented (bad for concurrency in certain workloads).

    Xaprb

    29 Oct 06 at 9:31 am

  10. if you want to keep one of the rows, in Sybase use delete TOP 1 from tablename where condition. So if you have two similar records, this will delete the first one.

    Yashraj Arora

    10 Nov 06 at 5:04 am

  11. Nice article. It is articles like these that still keep the Mysql open source fire burning.

    Web developers

    11 Jan 07 at 3:15 am

  12. but i didnt use primark key in my table. so now i want to know how to delete duplicate records without creating new table

    bharathi

    6 Feb 07 at 1:31 am

  13. Bharathi, please see my most recent article.

    Xaprb

    6 Feb 07 at 9:07 am

  14. Hi, I tried the query in technique 1 and I’m getting this error:

    #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘. rid , bad_rows . page AND good_rows . min_id <> bad_rows . id

    All I did was to change my table/column names and add one column in the GROUP BY and AS sections. We are running 4.1.12 of MySQL.

    Here is my whole query:

    SELECT bad_rows.*
    FROM reportData AS bad_rows
        INNER JOIN (
            SELECT rid, page, MIN(id) AS min_id
            FROM reportData
            GROUP BY rid, page
            HAVING count(*) > 1
        ) AS good_rows ON good_rows.rid, good_rows.page = bad_rows.rid, bad_rows.page
          AND good_rows.min_id <> bad_rows.id

    Have I done something wrong or do I need a newer version of MySQL??

    Thanks for your help.

    sgtMilburn

    28 Feb 07 at 2:31 pm

  15. Hi sgtMilburn, your join clause has a small syntax error. Try this:

    ) AS good_rows ON good_rows.rid = bad_rows.rid
          AND good_rows.page = bad_rows.page
          AND good_rows.min_id <> bad_rows.id

    Xaprb

    28 Feb 07 at 3:19 pm

  16. Works great!! Thanks much!

    sgtMilburn

    28 Feb 07 at 7:09 pm

  17. Great Article : I was searching for this and new it could be done but the HOW was evading me – I spent HOURS and HOURS trying to get it to work, all with that you need to put all in the group by when you have an aggregate, and the Need to precede with Exists ( yet always I was getting errors.) This is great. I wonder though on the idea to create a new table and then insert records into it. If I change the table to make a column unique (non primary key of course) – when I insert into the table from a table with duplicates – will I get an error since it cant insert but must update , how does that work ?

    Also the size of the table – for a large table – I wonder if it is really as efficient – since you must insert all rows into the new table.. where as you really only need to delete rows from the original table..

    Ken

    8 Mar 07 at 2:39 am

  18. Hi Ken, you could of course get errors. You’d have to unique-ify by all relevant columns so you didn’t get any conflicts. One thing MySQL can do that I’m not aware of in any other system is INSERT IGNORE, so if you’re using MySQL you might look into that.

    As far as efficiency on a large table, that really depends. It depends on how efficient your system is at deletes, and in MySQL’s case, which storage engine you’re using. It depends on whether rows are fixed or variable width, and if you have BLOB columns. It is also heavily dependent on how the table is indexed; a “simple delete” from a table with huge indexes might be much more work than you think, because all the indexes have to be rearranged. Sometimes, as strange as it sounds, it’s faster to build a new table fresh than punch holes in an old table and leave it all fragmented.

    Xaprb

    8 Mar 07 at 8:54 am

  19. This article helped me to come near the solution for my problem. It described the topic perfectly for me. But perhaps you can also give me a hint to solve the rest of my problem

    I have a table that includes a list of downloads. As it are the notify messages sent by cellphones this list naturally not only includes success messages. There are a lot of duplicate entries in it. But If a user tried to download more than once I can now find the entries and remove the duplicates thanks to your article. But there are also some “duplicates” in it where one of the entries is a Success message. This one I naturally want to keep.

    Your select, modified to fit my table looks like

    select bad_rows.*
    from cw10 as bad_rows
       inner join (
          select *, MIN(id) as min_id
          from cw10
          group by name, msisdn, fil_id, operator, language, useragent
          having count(*) > 1
       ) as good_rows on good_rows.name = bad_rows.name
          and good_rows.msisdn = bad_rows.msisdn
          and good_rows.language = bad_rows.language
          and good_rows.operator = bad_rows.operator
          and good_rows.useragent = bad_rows.useragent
          and good_rows.fil_id = bad_rows.fil_id
          and good_rows.min_id <> bad_rows.id;

    The success, failures and errors of a download are written into the messages field. And an entry can be identified by the combination of name, msisdn, fil_id, operator, language, useragent

    Is there a way to go through the table and remove any duplicates but if one of the duplicate messages is a success message keep that message.

    Thanks
    Paul

    Paul

    11 Mar 07 at 5:08 am

  20. Hi Paul, I would just put an appropriate WHERE clause everywhere it’s needed. I’m not crystal clear on what your data looks like, but just eliminating success messages from both SELECTs might do it.

    Xaprb

    12 Mar 07 at 11:45 am

  21. Thanks Xaprb, that’s an awesome work :)

    Amian

    13 Mar 07 at 6:28 pm

  22. Thanks! It worked perfectly with MySQL server version 4.1 for my site! You are awesome!

    David

    24 Mar 07 at 11:50 pm

  23. [...] There is another good article about deleting duplicate rows. Keywords: none [...]

  24. Is there any default column in sql server like “Rowid” in Oracle??

    Bata

    10 Apr 07 at 6:31 am

  25. No.

    Xaprb

    10 Apr 07 at 7:38 am

  26. I work with a ton of data. On several reports that I receive and convert to access tables, there is duplicate data in some of the fields within a certain column. for example, there may be more than one account number, but several different charges that are listed on the end of the table. it may look like this:

    123456       xxxxxx       xxxxxx        xxxxxx      1.50
    123456       xxxxxx       xxxxxx        xxxxxx      15.50
    159753       xxxxxx       xxxxxx        xxxxxx      17.50

    And so on…
    I need it to only show the one set of (123456) but with the totals of all the accounts with that number. Please help

    john goodman

    11 Apr 07 at 8:49 am

  27. That’s not duplicate data in my world :-) You need to use the GROUP BY clause. For faster help, try the #mysql IRC channel.

    Xaprb

    11 Apr 07 at 2:20 pm

  28. It will not work if our table has text, ntext, or image data type
    we wil get an error as:
    The text, ntext, or image data type cannot be selected as DISTINCT.

    so add another identity column..
    then search for replicating data and delete with that new ID..
    It will work definitely..

    regard
    Arun Thomas

    Arun Thomas

    11 Jul 07 at 2:54 am

  29. Really helpful stuff, thanks!

    But I’m having trouble adapting your Technique 1 for a particular purpose. I would like to eliminate duplicates only if they have a certain value for a given column (in this case “lang=’0′”). (I plan then to do a second pass to get rid of the rest by MIN id.)

    Seems to be a case for a WHERE clause but I just can’t figure out how to construct the query. Ideas appreciated!

    Rollo

    24 Jul 07 at 9:23 pm

  30. Nice Artical, very usefull to bigners like me

    By,
    Satya

    satya

    30 Jul 07 at 8:33 am

  31. Technique #1 worked really well for the problem I was having. I used it to check for the max. Thanks.

    john.mocambique

    15 Aug 07 at 11:53 am

  32. [...] How to delete duplicate rows with SQL at Xaprb excellent reference on how to delete dup rows in sql (tags: mysql database howto sql duplicates tips) [...]

  33. Thank you so much, this article is fantastic: I had exactly this problem and my solutions were too slow. With your ‘Technique 1′ I have solved my problem very quickly and the answer is very fast
    best regards

    Patrick Chevalier

    Patrick Chevalier

    21 Oct 07 at 4:01 am

  34. There’s probably some reason for having to do this internally to mySQL, but another way is this…

    sort -f -u -t , –key=2,51 cumindex.tab.txt > cumindex_unique.tab.txt

    Export the table, sort removing duplicate keys, import.

    BC_Man

    BC_Man

    1 Nov 07 at 8:03 pm

  35. [...] How to delete duplicate rows with SQL at Xaprb – [...]

  36. Very very helpful, and great discussion on the topic. Thanks!

    Sam

    28 Nov 07 at 6:39 am

  37. if you want to remove duplicate records with ntext,text and image datatype .please do the below link stap.

    http://www.codegroups.com/blog/index.php/deleting-duplicate-records-from-table-with-text-ntext-or-image-data-type/

    I hope this is help !

    Regards,
    Shaileshk

    shailesh

    11 Apr 08 at 4:27 am

  38. Very nice post.
    I am trying to find duplicate entries based on different column in a joined table. In my senerio I have upto 5 columns that has to be identical to determine if the entry is duplicate.
    I am using SQL Server 2005. Any help would be grately appretiated.

    IAyame

    8 May 08 at 5:23 pm

  39. super stuff got me out of a jam – was using a select top 1 querying the same table twice with two aliases – the technique 1 is excellent

    TheBigMG

    7 Aug 08 at 6:04 am

  40. works for me. excellent. thanks!

    Mark

    14 Oct 08 at 8:02 am

  41. just for the record my version is: MySQL 6.0.4-alpha community

    Mark

    14 Oct 08 at 8:04 am

  42. Hi, how do i list all my salesman-customers pairs who made the largest number of transactions together, whereby the customer has never complain against the salesman? That is, i have a lot of pairs, but i want to select all sellers to be distinct with their most popular customer

    noobsql

    15 Oct 08 at 4:34 am

  43. Technique 1 rocks. Thanks!

    This is a great blog for developers

  44. Dude, you’re a lifesaver – thanks a ton for this great article!

    Renie

    27 Oct 08 at 2:27 am

  45. Thank’s for the post

    hackus

    27 Oct 08 at 7:00 am

  46. I attempted to use this on a database that I manage, and I received the following error:

    #1064 – You have an error in your SQL syntax. Check the manual that corresponds to your MySQL server version for the right syntax to use near ’select ip, MIN(id) as min_id
    from sab_response
    grou

    The query that I have is:
    select bad_rows.*
    from sab_response as bad_rows
    inner join (
    select ip, MIN(id) as min_id
    from sab_response
    group by ip
    having count(*) > 1
    ) as good_rows on good_rows.ip = bad_rows.ip
    and good_rows.min_id bad_rows.id

    The queries work individually, so I am lead to believe that the problem lies in the join. The server is running MySQL 4.1.22.

    Thanks!

    rbright

    7 Nov 08 at 10:39 am

  47. [...] More information on this deletion query and other methods of deleting duplicates can be found at Xaprb.com. [...]

  48. yogesh

    25 Apr 09 at 4:58 am

  49. Thank you so much for the valuable information on removing redundancy from a table. Please follow the link below to get more information on Deleting duplicate records from a table efficiently.
    http://www.sqllion.com/2009/05/delete-duplicate-records/

    sql lion

    28 May 09 at 2:56 pm

  50. [...] information on this deletion query and other methods of deleting duplicates can be found at Xaprb.com.Written by Philip [...]

  51. This article is very impressive, superb!

    Roy Bellingan

    24 Dec 09 at 4:51 pm

  52. Thank you so much as you have save my life, i was searching….searching…searching to find how to remove duplicate. you way the best. keep up the good work

    timor

    15 Mar 10 at 9:32 am

  53. Hi. I’m adding my “thanks” to the “thank_you row”.

    jonez1

    18 Apr 10 at 3:26 pm

  54. Perfect – technique 3 worked for me! Thanks for the post!

    Chris

    21 May 10 at 7:12 am

Leave a Reply