How to delete duplicate rows with SQL, Part 2
By reader request, this article explains ways to remove duplicate rows when they are completely identical, and you don’t have a primary key or other criterion to identify which rows to “save.”
This is a special case of deleting duplicates. I’ve written another article about the more general case, so I assume you have the background it gives. If not, you should probably go read my article about how to delete duplicate rows in SQL.
Introduction
In general, this is a hard problem. Suppose you have the following data, and you want to delete everything but the first row of its type (you don’t care which, because all duplicate rows are completely identical).
| Fruit |
|---|
| Oranges |
| Oranges |
| Oranges |
| Apples |
| Apples |
| Apples |
| Apples |
| Apples |
When you’re done, you want just two rows in the table:
| Fruit |
|---|
| Oranges |
| Apples |
Why this is hard
This is hard because there is no way to do this in standard SQL (correct me if I’m wrong). SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set. That’s why SQL doesn’t give you tools to solve this problem.
No database product is truly relational, so in real life it’s possible for duplicates to occur. When it happens, you will have to resort to platform-specific methods to solve it. There should always be a way to do it, because there is always a difference between apparently identical rows. It might be an internal row ID, for example (as in Oracle). If nothing else, the rows have different memory and disk locations in the computer.
The easy way
The easiest thing to do is add a column with a unique number. This is called something different on every platform: it’s an IDENTITY column in SQL Server, an AUTO_INCREMENT column in MySQL, a SERIAL in PostgreSQL, and so on. Look at your platform’s documentation for instructions how to do it.
Once you’ve done that, you’re on easy street. Now go read my previous article to do the actual deleting.
If that won’t do…
Build a new table with distinct values from the old table, then drop and rename:
CREATE TABLE new_fruits ...; INSERT INTO new_fruits(fruit) SELECT DISTINCT fruit FROM fruits; DROP TABLE fruits; RENAME TABLE new_fruits fruits;
If you can’t do that…
Perhaps you simply can’t do either of the above. Maybe your table is too large, for example. In that case you’re going to have to use some sort of iterative technique to do it; loop through the rows one at a time and delete every row you see more than once. This is also going to be a platform-specific solution; you may need to use a WHILE loop or server-side cursor. Consult your platform’s documentation for more; I can’t possibly cover all the bases here.
Two examples for MySQL
Here’s a quick technique that uses advanced user-variable techniques on MySQL to delete the rows. MySQL’s server-side cursors are read-only, so some other technique has to be used. User-variables can do the trick, if you write the statement just right — it’s very touchy.
set @num := 0, @type := ''; delete from fruits where greatest(0, @num := if(type = @type, @num + 1, 0), least(0, length(@type := type))) > 1 order by type;
If you don’t understand that, go read the article :-) This can be very efficient because it doesn’t require any GROUP BY clause. If your rows are “naturally ordered” with all the duplicates adjacent to each other, you can even omit the ORDER BY clause (if your rows aren’t “sorted naturally,” you will miss some duplicate rows).
The other obvious option is to repeatedly identify a duplicated row, find how many times it’s duplicated, and delete one less than that many rows. You will need to either do this in a stored routine, or get help from some programming language. For example, in pseudo-code:
set @num := 0;
select @type := type, @num := count(*)
from fruits
group by type
having count(*) > 1
limit 1;
while @num > 0
delete from fruits where type = 'type'
limit @num - 1;
set @num := 0;
select @type := type, @num := count(*)
from fruits
group by type
having count(*) > 1
limit 1;
end while
That is pseudo-code, by the way; if you’re doing this in a stored procedure, you’re going to have to concatenate strings together to make an executable statement and execute it. If you’re using an external programming language, you’ll need to fetch the values that are duplicated and dynamically build a statement that deletes all but one row.
Summary
In this article I explained how to solve the special-case problem of removing duplicate rows with no distinguishing columns at all. It’s a harder case of the general problem, and SQL has no built-in way to solve it, so you have to learn your platform’s tricks to solve it. I showed you how to add a unique column so you can use the “easy” techniques I explained in an earlier article. You might also be able to put the rows into another table and drop the original table. Failing that, you have to use something like cursors. As a bonus, I explained two ways to do this in MySQL, one of them sneaky and the other not.



SQL Server Version:
set nocount on create table #tbFruit (Fruit varchar(20)) insert into #tbFruit (Fruit) values ('apple') insert into #tbFruit (Fruit) values ('apple') insert into #tbFruit (Fruit) values ('apple') insert into #tbFruit (Fruit) values ('apple') insert into #tbFruit (Fruit) values ('orange') insert into #tbFruit (Fruit) values ('orange') insert into #tbFruit (Fruit) values ('orange') create table #tbFruitsWithDupes (DupeID int identity, Fruit varchar(20), DupeCount int) insert into #tbFruitsWithDupes (Fruit, DupeCount) select Fruit, count(*) DupeCount from #tbFruit group by Fruit having count(*) > 1 declare @x int, @max int, @Fruit varchar(20), @DupeCount int select @x = 1, @max = max(DupeID) from #tbFruitsWithDupes while (@x <= @max) begin select @Fruit = Fruit, @DupeCount = DupeCount - 1 from #tbFruitsWithDupes Where DupeID = @x set rowcount @DupeCount delete from #tbFruit where Fruit = @Fruit select @x = @x + 1 end select * from #tbFruit drop table #tbFruit drop table #tbFruitsWithDupesBill Minton
6 Feb 07 at 11:11 am
Bill Minton: bravo! Thanks for writing this in. The
SET ROWCOUNTis exactly the type of platform-specific trick I was talking about.Xaprb
6 Feb 07 at 11:17 am
Why not to add unique index to drop duplicate rows? It works only in MySQL.
alter ignore table dupTest add unique index (a,b)
Shantanu Oak
7 Feb 07 at 8:08 am
Thanks Shantanu, I didn’t know about that!
Xaprb
7 Feb 07 at 8:51 am
I have a genius method for removing duplicates.. open enterprise manager.. list all rows and then go through deleting the rows that are the same using the delete button.
OK, OK.. not so JENIUS after all.. worked for me though.
Collin
8 Feb 07 at 9:59 pm
Hi, thanks for these nice articles.
Maybe
least(0, length(@type := type))) > 0instead of
least(0, length(@type := type))) > 1I’ve tested with two columns (MySQL 5.0.27) and it works very well
jimro
2 Mar 07 at 6:40 pm
Hi, sorry for my poor english.
I forgot to format a part of my previous message in XHTML, thank you for correction.
With this
least(0, length(@type := type))) > 0only one row is preserved.
Isn’t this the required goal?
Regarding my previous code, it would be better like this:
jimro
3 Mar 07 at 8:26 am
i want to delete the records on my table that has simllar entries in my column. e.g column A: has the following entries [aaaaaaa]
[aaaaamobile]
[aaaamotoraaaa]
now i want to delete only the a’s
Joseph
17 Oct 07 at 6:34 am
BINGO !!!!
awesome article
CHEERS!!
—
Diwakar
Diwakar
12 Nov 07 at 9:36 am
Great article
Shamsh
10 Oct 08 at 2:58 pm
Here’s a nice ruby script that does the work on a two-column join table without unique ids. Select_all and execute are methods on ActiveRecord::Base.connection
def self.duplicates(table, left, right)
rows = select_all(“select distinct #{left} as id from #{table}”)
rows.each do |row|
dups = select_all(“select #{right} as rid, count(#{right}) as count from #{table} where #{left} = #{row['id']} group by #{right} having count > 1″)
next if dups.empty?
dups.each do |dup|
puts “#{table} #{left} #{row['id']} #{right} #{dup['rid']}”
execute “delete from #{table} where #{left} = #{row['id']} and #{right} = #{dup['rid']}”
execute “insert into #{table} (#{left}, #{right}) values(#{row['id']}, #{dup['rid']})”
end
end
nil
end
Jim
31 Dec 08 at 12:10 am
This looks much simpeler, and it works as well:
delete mytable
where rowid not in
(select min(rowid) from mytable
[where criterium] <-optional
group by <-mention ALL fields
);
Edwin
3 Apr 09 at 9:34 am
It motivate me in a new way.
But I wonder if most of you use mySQL instead SQL server.
Itry
30 Sep 10 at 9:06 pm
If someone see, that will be fine
Itry
30 Sep 10 at 9:07 pm
Thank you for the article, it was of considerable help. I had a table with many records but no primary key and I had to create a primary key on it. The problem was there were duplicates.
Using the inner join query from your other article I was able to create a temporary table where I inserted the unique records found from all duplicates. Then I was able to delete all the duplicates from the original table, add the constraint, and then add the deleted records back in from the temporary table.
Here is my code for those who might need this help later.
{code}
connect to TE_DFLT;
set current schema EPOS;
create table safeauditcopy (sitedirectoryid bigint not null,
safeid vargraphic(32) not null,
updatedatetime timestamp not null,
debitamount decimal(19) not null,
creditamount decimal(19) not null,
aftertranssafeamount decimal(19) not null,
businessdate timestamp not null,
tlogtransactionid character(22) not null constraint XPKSafeAudit PRIMARY KEY
);
insert into safeauditcopy
select sitedirectoryid,safeid,updatedatetime,debitamount,creditamount,aftertranssafeamount,businessdate,tlogtransactionid
from (
select bad_rows.*, rownumber() over(partition by bad_rows.tlogtransactionid) as id
from safeaudit as bad_rows
inner join (
select tlogtransactionid
from safeaudit
group by tlogtransactionid
having count(*) > 1
) as good_rows on good_rows.tlogtransactionid = bad_rows.tlogtransactionid
) where id > 1;
delete from safeaudit where tlogtransactionid in (
select tlogtransactionid
from safeaudit
group by tlogtransactionid
having count(*) > 1
);
alter table safeaudit add constraint XPKSafeAudit primary key (TLOGTRANSACTIONID);
insert into safeaudit select * from safeauditcopy;
drop table safeauditcopy;
{code}
Lyndon
7 Jun 11 at 5:13 pm
I forgot to mention, the code was for DB2 and makes use of the OLAP function rownumber() to give me the ability to determine the difference between one record and another…
Lyndon
7 Jun 11 at 5:16 pm
The code
set @num := 0, @source := ”, @target := ”;
delete from test
where greatest(0,
@num := if(source = @source and target = @target, @num + 1, 0),
least(0, length(@source := source), length(@target := target))) > 0
order by source, target;
Does not work when the rows are not in constitutive order.
Deepak
14 Jul 11 at 4:45 am