Xaprb

Stay curious!

How to delete duplicate rows with SQL, Part 2

with 17 comments

By reader request, this article explains ways to remove duplicate rows when they are completely identical, and you don’t have a primary key or other criterion to identify which rows to “save.”

This is a special case of deleting duplicates. I’ve written another article about the more general case, so I assume you have the background it gives. If not, you should probably go read my article about how to delete duplicate rows in SQL.

Introduction

In general, this is a hard problem. Suppose you have the following data, and you want to delete everything but the first row of its type (you don’t care which, because all duplicate rows are completely identical).

When you’re done, you want just two rows in the table:

Why this is hard

This is hard because there is no way to do this in standard SQL (correct me if I’m wrong). SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set. That’s why SQL doesn’t give you tools to solve this problem.

No database product is truly relational, so in real life it’s possible for duplicates to occur. When it happens, you will have to resort to platform-specific methods to solve it. There should always be a way to do it, because there is always a difference between apparently identical rows. It might be an internal row ID, for example (as in Oracle). If nothing else, the rows have different memory and disk locations in the computer.

The easy way

The easiest thing to do is add a column with a unique number. This is called something different on every platform: it’s an IDENTITY column in SQL Server, an AUTO_INCREMENT column in MySQL, a SERIAL in PostgreSQL, and so on. Look at your platform’s documentation for instructions how to do it.

Once you’ve done that, you’re on easy street. Now go read my previous article to do the actual deleting.

If that won’t do…

Build a new table with distinct values from the old table, then drop and rename:

CREATE TABLE new_fruits ...;

INSERT INTO new_fruits(fruit)
   SELECT DISTINCT fruit FROM fruits;

DROP TABLE fruits;

RENAME TABLE new_fruits fruits;

If you can’t do that…

Perhaps you simply can’t do either of the above. Maybe your table is too large, for example. In that case you’re going to have to use some sort of iterative technique to do it; loop through the rows one at a time and delete every row you see more than once. This is also going to be a platform-specific solution; you may need to use a WHILE loop or server-side cursor. Consult your platform’s documentation for more; I can’t possibly cover all the bases here.

Two examples for MySQL

Here’s a quick technique that uses advanced user-variable techniques on MySQL to delete the rows. MySQL’s server-side cursors are read-only, so some other technique has to be used. User-variables can do the trick, if you write the statement just right — it’s very touchy.

set @num := 0, @type := '';

delete from fruits
where greatest(0,
   @num := if(type = @type, @num + 1, 0),
   least(0, length(@type := type))) > 1
order by type;

If you don’t understand that, go read the article :-) This can be very efficient because it doesn’t require any GROUP BY clause. If your rows are “naturally ordered” with all the duplicates adjacent to each other, you can even omit the ORDER BY clause (if your rows aren’t “sorted naturally,” you will miss some duplicate rows).

The other obvious option is to repeatedly identify a duplicated row, find how many times it’s duplicated, and delete one less than that many rows. You will need to either do this in a stored routine, or get help from some programming language. For example, in pseudo-code:

set @num := 0;

select @type := type, @num := count(*)
   from fruits
   group by type
   having count(*) > 1
   limit 1;

while @num > 0

   delete from fruits where type = 'type'
      limit @num - 1;

   set @num := 0;
   select @type := type, @num := count(*)
      from fruits
      group by type
      having count(*) > 1
      limit 1;

end while

That is pseudo-code, by the way; if you’re doing this in a stored procedure, you’re going to have to concatenate strings together to make an executable statement and execute it. If you’re using an external programming language, you’ll need to fetch the values that are duplicated and dynamically build a statement that deletes all but one row.

Summary

In this article I explained how to solve the special-case problem of removing duplicate rows with no distinguishing columns at all. It’s a harder case of the general problem, and SQL has no built-in way to solve it, so you have to learn your platform’s tricks to solve it. I showed you how to add a unique column so you can use the “easy” techniques I explained in an earlier article. You might also be able to put the rows into another table and drop the original table. Failing that, you have to use something like cursors. As a bonus, I explained two ways to do this in MySQL, one of them sneaky and the other not.

Further Reading:

Written by Xaprb

February 6th, 2007 at 9:03 am

Posted in Uncategorized

Tagged with , ,

17 Responses to 'How to delete duplicate rows with SQL, Part 2'

Subscribe to comments with RSS

  1. SQL Server Version:

    set nocount on
    create table #tbFruit (Fruit varchar(20))
    insert into #tbFruit (Fruit) values ('apple')
    insert into #tbFruit (Fruit) values ('apple')
    insert into #tbFruit (Fruit) values ('apple')
    insert into #tbFruit (Fruit) values ('apple')
    insert into #tbFruit (Fruit) values ('orange')
    insert into #tbFruit (Fruit) values ('orange')
    insert into #tbFruit (Fruit) values ('orange')
    
    create table #tbFruitsWithDupes (DupeID int identity, Fruit varchar(20), DupeCount int)
    
    insert into #tbFruitsWithDupes (Fruit, DupeCount)
    	select Fruit, count(*) DupeCount from #tbFruit
    	group by Fruit having count(*) > 1
    
    declare @x int, @max int, @Fruit varchar(20), @DupeCount int
    
    select @x = 1, @max	= max(DupeID) from #tbFruitsWithDupes
    
    while (@x <= @max)
    	begin
    		select @Fruit = Fruit, @DupeCount = DupeCount - 1
    			from #tbFruitsWithDupes Where DupeID = @x
    
    		set rowcount @DupeCount
    
    		delete from #tbFruit where Fruit = @Fruit
    
    		select @x = @x + 1
    	end
    
    select * from #tbFruit
    
    drop table #tbFruit
    drop table #tbFruitsWithDupes

    Bill Minton

    6 Feb 07 at 11:11 am

  2. Bill Minton: bravo! Thanks for writing this in. The SET ROWCOUNT is exactly the type of platform-specific trick I was talking about.

    Xaprb

    6 Feb 07 at 11:17 am

  3. Why not to add unique index to drop duplicate rows? It works only in MySQL.
    alter ignore table dupTest add unique index (a,b)

    Shantanu Oak

    7 Feb 07 at 8:08 am

  4. Thanks Shantanu, I didn’t know about that!

    Xaprb

    7 Feb 07 at 8:51 am

  5. I have a genius method for removing duplicates.. open enterprise manager.. list all rows and then go through deleting the rows that are the same using the delete button.

    OK, OK.. not so JENIUS after all.. worked for me though.

    Collin

    8 Feb 07 at 9:59 pm

  6. Hi, thanks for these nice articles.

    Maybe
    least(0, length(@type := type))) > 0
    instead of
    least(0, length(@type := type))) > 1

    I’ve tested with two columns (MySQL 5.0.27) and it works very well

    set @num := 0, @source := '', @target := '';
    delete from test
    where greatest(0,
       @num := if(source = @source and target = @target, @num + 1, 0),
       least(0, length(@source := source), length(@target := target))) > 0
    order by source;

    jimro

    2 Mar 07 at 6:40 pm

  7. Hi, sorry for my poor english.

    I forgot to format a part of my previous message in XHTML, thank you for correction.

    With this
    least(0, length(@type := type))) > 0
    only one row is preserved.
    Isn’t this the required goal?

    Regarding my previous code, it would be better like this:

    set @num := 0, @source := '', @target := '';
    delete from test
    where greatest(0,
       @num := if(source = @source and target = @target, @num + 1, 0),
       least(0, length(@source := source), length(@target := target))) > 0
    order by source, target;

    jimro

    3 Mar 07 at 8:26 am

  8. i want to delete the records on my table that has simllar entries in my column. e.g column A: has the following entries [aaaaaaa]
    [aaaaamobile]
    [aaaamotoraaaa]
    now i want to delete only the a’s

    Joseph

    17 Oct 07 at 6:34 am

  9. BINGO !!!!
    awesome article

    CHEERS!!

    Diwakar

    Diwakar

    12 Nov 07 at 9:36 am

  10. Great article

    Shamsh

    10 Oct 08 at 2:58 pm

  11. Here’s a nice ruby script that does the work on a two-column join table without unique ids. Select_all and execute are methods on ActiveRecord::Base.connection

    def self.duplicates(table, left, right)
    rows = select_all(“select distinct #{left} as id from #{table}”)
    rows.each do |row|
    dups = select_all(“select #{right} as rid, count(#{right}) as count from #{table} where #{left} = #{row['id']} group by #{right} having count > 1″)
    next if dups.empty?
    dups.each do |dup|
    puts “#{table} #{left} #{row['id']} #{right} #{dup['rid']}”
    execute “delete from #{table} where #{left} = #{row['id']} and #{right} = #{dup['rid']}”
    execute “insert into #{table} (#{left}, #{right}) values(#{row['id']}, #{dup['rid']})”
    end
    end
    nil
    end

    Jim

    31 Dec 08 at 12:10 am

  12. This looks much simpeler, and it works as well:

    delete mytable
    where rowid not in
    (select min(rowid) from mytable
    [where criterium] <-optional
    group by <-mention ALL fields
    );

    Edwin

    3 Apr 09 at 9:34 am

  13. It motivate me in a new way.
    But I wonder if most of you use mySQL instead SQL server.

    Itry

    30 Sep 10 at 9:06 pm

  14. If someone see, that will be fine

    Itry

    30 Sep 10 at 9:07 pm

  15. Thank you for the article, it was of considerable help. I had a table with many records but no primary key and I had to create a primary key on it. The problem was there were duplicates.

    Using the inner join query from your other article I was able to create a temporary table where I inserted the unique records found from all duplicates. Then I was able to delete all the duplicates from the original table, add the constraint, and then add the deleted records back in from the temporary table.

    Here is my code for those who might need this help later.

    {code}
    connect to TE_DFLT;

    set current schema EPOS;

    create table safeauditcopy (sitedirectoryid bigint not null,
    safeid vargraphic(32) not null,
    updatedatetime timestamp not null,
    debitamount decimal(19) not null,
    creditamount decimal(19) not null,
    aftertranssafeamount decimal(19) not null,
    businessdate timestamp not null,
    tlogtransactionid character(22) not null constraint XPKSafeAudit PRIMARY KEY
    );

    insert into safeauditcopy
    select sitedirectoryid,safeid,updatedatetime,debitamount,creditamount,aftertranssafeamount,businessdate,tlogtransactionid
    from (
    select bad_rows.*, rownumber() over(partition by bad_rows.tlogtransactionid) as id
    from safeaudit as bad_rows
    inner join (
    select tlogtransactionid
    from safeaudit
    group by tlogtransactionid
    having count(*) > 1
    ) as good_rows on good_rows.tlogtransactionid = bad_rows.tlogtransactionid
    ) where id > 1;

    delete from safeaudit where tlogtransactionid in (
    select tlogtransactionid
    from safeaudit
    group by tlogtransactionid
    having count(*) > 1
    );

    alter table safeaudit add constraint XPKSafeAudit primary key (TLOGTRANSACTIONID);

    insert into safeaudit select * from safeauditcopy;

    drop table safeauditcopy;

    {code}

    Lyndon

    7 Jun 11 at 5:13 pm

  16. I forgot to mention, the code was for DB2 and makes use of the OLAP function rownumber() to give me the ability to determine the difference between one record and another…

    Lyndon

    7 Jun 11 at 5:16 pm

  17. The code

    set @num := 0, @source := ”, @target := ”;
    delete from test
    where greatest(0,
    @num := if(source = @source and target = @target, @num + 1, 0),
    least(0, length(@source := source), length(@target := target))) > 0
    order by source, target;

    Does not work when the rows are not in constitutive order.

    Deepak

    14 Jul 11 at 4:45 am

Leave a Reply