Xaprb

Stay curious!

How to find duplicate rows with SQL

with 60 comments

This article shows how to find duplicated rows in a database table. This is a very common beginner question. The basic technique is straightforward. I’ll also show some variations, such as how to find “duplicates in two columns” (a recent question on the #mysql IRC channel).

How to find duplicated rows

The first step is to define what exactly makes a row a duplicate of another row. Most of the time this is easy: they have the same value in some column. I’ll take this as a working definition for this article, but you may need to alter the queries below if your notion of “duplicate” is more complicated.

For this article, I’ll use this sample data:

create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08');
insert into test(id, day) values(2, '2006-10-08');
insert into test(id, day) values(3, '2006-10-09');

select * from test;
+----+------------+
| id | day        |
+----+------------+
|  1 | 2006-10-08 |
|  2 | 2006-10-08 |
|  3 | 2006-10-09 |
+----+------------+

The first two rows have the same value in the day column, so if I consider those to be duplicates, here’s a query to find them. The query uses a GROUP BY clause to put all the rows with the same day value into one “group” and then count the size of the group:

 select day, count(*) from test GROUP BY day;
+------------+----------+
| day        | count(*) |
+------------+----------+
| 2006-10-08 |        2 |
| 2006-10-09 |        1 |
+------------+----------+

The duplicated rows have a count greater than one. If you only want to see rows that are duplicated, you need to use a HAVING clause (not a WHERE clause), like this:

select day, count(*) from test group by day HAVING count(*) > 1;
+------------+----------+
| day        | count(*) |
+------------+----------+
| 2006-10-08 |        2 |
+------------+----------+

This is the basic technique: group by the column that contains duplicates, and show only those groups having more than one row.

Why can’t you use a WHERE clause?

A WHERE clause filters the rows before they are grouped together. A HAVING clause filters them after grouping. That’s why you can’t use a WHERE clause in the above query.

How to delete duplicate rows

A related question is how to delete the ‘duplicate’ rows once you find them. A common task when cleaning up bad data is to delete all but one of the duplicates, so you can put proper indexes and primary keys on the table, and prevent duplicates from getting into the table again.

Again, the first thing to do is make sure your definition is clear. Exactly which row do you want to keep? The ‘first’ one? The one with the largest value of some column? For this article, I’ll assume you want to keep the ‘first’ row — the one with the smallest value of the id column. That means you want to delete every other row.

Probably the easiest way to do this is with a temporary table. Especially in MySQL, there are some restrictions about selecting from a table and updating it in the same query. You can get around these, as I explain in my article How to select from an update target in MySQL, but I’ll just avoid these complications and use a temporary table.

The exact definition of the task is to delete every row that has a duplicate, except the row with the minimal value of id for that group. So you need to find not only the rows where there’s more than one in the group, you also need to find the row you want to keep. You can do that with the MIN() function. Here are some queries to create the temporary table and find the data you need to do the DELETE:

create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
   select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
+------------+--------+
| day        | min_id |
+------------+--------+
| 2006-10-08 |      1 |
+------------+--------+

Now that you have this data, you can proceed to delete the ‘bad’ rows. There are many ways to do this, and some are better than others (see my article about many-to-one problems in SQL), but again I’ll avoid the finer points and just show you a standard syntax that ought to work in any RDBMS that supports subqueries:

delete from test
   where exists(
      select * from to_delete
      where to_delete.day = test.day and to_delete.min_id <> test.id
   )

If your RDBMS does not support subqueries, or if it’s more efficient, you may wish to do a multi-table delete. The syntax for this varies between systems, so you need to consult your system’s documentation. You may also need to do all of this in a transaction to avoid other users changing the data while you’re working, if that’s a concern.

How to find duplicates in multiple columns

Someone recently asked a question similar to this on the #mysql IRC channel:

I have a table with columns b and c that links two other tables b and c, and I want to find all rows that have duplicates in either b or c.

It was difficult to understand exactly what this meant, but after some conversation I grasped it: the person wanted to be able to put unique indexes on columns b and c separately.

It’s pretty easy to find rows with duplicate values in one or the other column, as I showed you above: just group by that column and count the group size. And it’s easy to find entire rows that are exact duplicates of other rows: just group by as many columns as you need. But it’s harder to identify rows that have either a duplicated b value or a duplicated c value. Take the following sample table, which is roughly what the person described:

create table a_b_c(
   a int not null primary key auto_increment,
   b int,
   c int
);

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);

Now, you can easily see there are some ‘duplicate’ rows in this table, but no two rows actually have the same tuple {b, c}. That’s why this is a bit more difficult to solve.

Queries that don’t work

If you group by two columns together, you’ll get various results depending on how you group and count. This is where the IRC user was getting stumped. Sometimes queries would find some duplicates but not others. Here are some of the things this person tried:

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
   or count(distinct c > 1);

This query returns every row in the table, with a COUNT(*) of 1, which seems to be wrong behavior, but it’s actually not. Why? Because the > 1 is inside the COUNT(). It’s pretty easy to miss, but this query is actually the same as

select b, c, count(*) from a_b_c
group by b, c
having count(1)
   or count(1);

Why? Because (b > 1) is a boolean expression. That’s not what you want at all. You want

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
   or count(distinct c) > 1;

This returns zero rows, of course, because there are no duplicate {b, c} tuples. The person tried many other combinations of HAVING clauses and ORs and ANDs, grouping by one column and counting the other, and so forth:

select b, count(*) from a_b_c group by b having count(distinct c) > 1;
+------+----------+
| b    | count(*) |
+------+----------+
|    1 |        3 |
|    2 |        3 |
|    3 |        3 |
+------+----------+

Nothing found all the duplicates, though. What I think made it most frustrating is that it partially worked, making the person think it was almost the right query… perhaps just another variation would get it…

In fact, it’s impossible to do with this type of simple GROUP BY query. Why is this? It’s because when you group by one column, you distribute like values of the other column across multiple groups. You can see this visually by ordering by those columns, which is what grouping does. First, order by column b and see how they are grouped:

When you order (group) by column b, the duplicate values in column c are distributed into different groups, so you can’t count them with COUNT(DISTINCT c) as the person was trying to do. Aggregate functions such as COUNT() only operate within a group, and have no access to rows that are placed in other groups. Similarly, when you order by c, the duplicate values in column b are distributed into different groups. It is not possible to make this query do what’s desired.

Some correct solutions

Probably the simplest solution is to find the duplicates for each column separately and UNION them together, like this:

select b as value, count(*) as cnt, 'b' as what_col
 from a_b_c group by b having count(*) > 1
 union
 select c as value, count(*) as cnt, 'c' as what_col
 from a_b_c group by c having count(*) > 1;
+-------+-----+----------+
| value | cnt | what_col |
+-------+-----+----------+
|     1 |   3 | b        |
|     2 |   3 | b        |
|     3 |   3 | b        |
|     1 |   3 | c        |
|     2 |   3 | c        |
|     3 |   3 | c        |
+-------+-----+----------+

The what_col column in the output indicates what column the duplicate value was found in. Another approach is to use subqueries:

select a, b, c from a_b_c
 where b in (select b from a_b_c group by b having count(*) > 1)
    or c in (select c from a_b_c group by c having count(*) > 1);
+----+------+------+
| a  | b    | c    |
+----+------+------+
|  7 |    1 |    1 |
|  8 |    1 |    2 |
|  9 |    1 |    3 |
| 10 |    2 |    1 |
| 11 |    2 |    2 |
| 12 |    2 |    3 |
| 13 |    3 |    1 |
| 14 |    3 |    2 |
| 15 |    3 |    3 |
+----+------+------+

This is probably much less efficient than the UNION approach, and will show every duplicated row, not just the values that are duplicated. Still another approach is to do self-joins against grouped subqueries in the FROM clause. This is more complicated to write correctly, but might be necessary for some complex data, or for efficiency:

select a, a_b_c.b, a_b_c.c
from a_b_c
   left outer join (
      select b from a_b_c group by b having count(*) > 1
   ) as b on a_b_c.b = b.b
   left outer join (
      select c from a_b_c group by c having count(*) > 1
   ) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null

Any of these queries will do, and I’m sure there are other ways too. If you can use UNION, it’s probably the easiest.

Written by Xaprb

October 9th, 2006 at 10:56 pm

Posted in SQL

60 Responses to 'How to find duplicate rows with SQL'

Subscribe to comments with RSS or TrackBack to 'How to find duplicate rows with SQL'.

  1. Why can’t you use a WHERE clause?

    As a matter of fact we can as: – SQL is an extremely redundant language – HAVING clause is one of the redundancies

    select day
         , min(id)
         , count(*)
      from test
     where 1 < (select count(*)
                  from test t_dup
                 where t_dup.day = test.day
               )
     group by day
    

    Dmitri Mikhailov

    10 Oct 06 at 1:11 pm

  2. Well spoken! Still, it’s not quite the same thing. You wrote a correlated subquery. I was speaking about why you can’t use a WHERE clause like this:

    select ... WHERE COUNT(*) > 1

    This is a common thing beginners try before they learn about the HAVING clause. I just wanted to address that question along the way.

    Xaprb

    10 Oct 06 at 1:28 pm

  3. It’s possible to remove duplicates from table test with just one query, use:

    select * from test group by day HAVING count(distinct id > 1)

    balluche

    10 Oct 06 at 5:26 pm

  4. That query doesn’t do what you think — run it and see. You are right though, it’s possible to remove duplicates with a single query. I meant to keep this very basic for beginners, but maybe it’s a good idea for me to fill in some techniques about single-query techniques. For one thing, they eliminate the need for a transaction.

    Xaprb

    10 Oct 06 at 10:08 pm

  5. Excellent, just excellent. Good explanations and examples.

    Thank you.

    Matt Bunter

    11 Oct 06 at 9:15 am

  6. Well, I may not understand but for example with this table

    1	2006-10-08
    2	2006-10-08
    3	2006-10-09
    4	2006-10-08
    5	2006-10-09
    6	2006-10-10
    7	2006-10-11

    Result of the query above is

    1	2006-10-08
    3	2006-10-09
    6	2006-10-10
    7	2006-10-11

    Distinct is not necessary. I may be wrong but I would like to know how you remove duplicates with one query. Thanks !

    balluche

    12 Oct 06 at 4:42 pm

  7. balluce, my next article is about how to remove duplicates (I wrote it since you first commented). I hope that helps.

    About your query, there are two things happening that aren’t the way they seem. One is that you are selecting non-GROUPed columns in a GROUPed statement. This is something MySQL allows you to do, but it is technically not correct. It confuses people a lot. If you know how MySQL works, you can use it, but it is still a bad idea in my opinion. You should definitely not ever use it unless you have a firm grasp on the mathematics of why it is technically invalid.

    The other thing is your query is using the COUNT() function incorrectly. COUNT() counts defined values, not “true” values. COUNT(id > 1) is really the same as “count the number of times the expression ‘a row has an id greater than one’ has a defined value,” not “count the number of times id is greater than one.” The result of the boolean expression “this id value is greater than 1″ is always a defined value unless id is NULL, so even when id is not greater than one, it still contributes to the count. To make this clearer, try this query:

    select count(0 > 1) from test;

    The result is not what you expected, is it? You need to put the “greater than one” outside the COUNT() function.

    This is why your query returns a row for 2006-10-10 even though that date appears only once in the table, and is hence not a duplicate.

    Xaprb

    12 Oct 06 at 5:11 pm

  8. Thank you for your response. You’re right, the HAVING CLAUSE is not necessary obviously. Therefore my query doesn’t remove duplicate but display non duplicate lines. It’s possible then to save the result in a temporary table and to upload data to the original table.

    More portable query is:


    select min(id), day from test group by day

    balluche

    13 Oct 06 at 12:39 pm

  9. I have a complicated SQL query which puts the following information into a temporary table which I then sort and return to display on the web (intranet):

    TrackingID, DetailID, Requester, ServiceDirector, MovedDate, Location, …

    Sometimes there are duplicates – same TrackingID & DetailID (same TrackingID and unique DetailID is not a duplicate). In fact the “duplicate” only differs by ServiceDirector. I want the “last” ServiceDirector – not an order by ServiceDirector but the last displayed in the query.

    Any ideas on how to return the record with the “last” ServiceDirector from the stord procedrue?

    Lynn MacFarland

    3 Nov 06 at 8:14 pm

  10. I have a draft of another post about this, but you can do a self-join:

    select tbl.* from tbl
       inner join (
          select TrackingID, DetailID, max(ServiceDirector) as ServiceDirector
          from tbl group by TrackingID, DetailID
       ) as x on x.TrackingID = tbl.TrackingID
          and x.DetailID = tbl.DetailID
          and x.ServiceDirector = tbl.ServiceDirector

    Basically, group the subquery by whatever is unique in the table, and then join that result against the full table on the unique columns plus the max() or min().

    You can also do a correlated subquery, but test the performance carefully — it may be very bad!

    select * from tbl
    where ServiceDirector = (
       select max(ServiceDirector)
       from tbl as tbl2
       where tbl.TrackingID = tbl2.TrackingID
          and tbl.DetailID = tbl2.DetailID)

    Xaprb

    3 Nov 06 at 9:17 pm

  11. Hi there, I have a similar issue but slightly complicated. I have 3 tables.

    select * from Communications;
    +-----------------+----------+--------------+-------------+
    | CommunicationID | EntityID | EntityTypeID | Detail      |
    +-----------------+----------+--------------+-------------+
    |               3 |       11 |            8 | abc@abc.com |
    |               4 |        2 |           13 | abc@abc.com |
    |               5 |       12 |            8 | xyz@abc.com |
    |               6 |        3 |           13 | abc@abc.com |
    +-----------------+----------+--------------+-------------+
    
    select * from Sites;
    +--------+------------+
    | SiteID | SupplierID |
    +--------+------------+
    |      2 |         11 |
    |      3 |         12 |
    +--------+------------+
    
    select * from Suppliers;
    +------------+--------------+
    | SupplierID | SupplierName |
    +------------+--------------+
    |         11 | Jims         |
    |         12 | XYZ          |
    +------------+--------------+

    Sites belong to a supplier. Communications Table keeps the Communication details of both the Sites and the suppliers. EntityID is the ID of either the supplier or the Site, which is determined by the EntityTypeID. Entity type ID 8 = Supplier and 13 = Site.

    As you can see that communication table can hold both the sites and suppliers communication details, and in some cases the supplier details and the sites details in table communication detail column can be the same as shown above (abc@abc.com) thus duplicating the value. I want to get a list of the duplicate values and then delete them. Duplicate values are only those where the supplier owns the site, meaning if the Site is not of the same supplier as shown in Site table (which there is 11) than even if the values are same its not duplicating.

    To put in Simple Terms, Supplier 11 has Communication Detail abc@abc.com in the Communication table.
    and owns Site 2 which also has Detail in Communications Table abc@abc.com. ( the Detail must be same if its different then its ok eg if the Detail of Site 2 is def@abc.com then its fine)

    So this is duplication which i wana list out and remove. And if Site 3 has the same Detail in the Communications Table abc@abc.com, and is owned by Supplier 12 its not duplication. as the supplier is different. So Communications Record 4 is not duplication but record 1 and 2 has duplicate details as the Supplier and Sites are related.

    Please help me if you understand it and if its not clear let me know.

    Regards
    Nom

    Nom

    9 Nov 06 at 1:01 pm

  12. Nom, if I’m understanding you correctly, the following query makes it obvious which entries are duplicated:

    select Detail, coalesce(SupplierID, EntityID) as SupplierID
    from Communications
        left outer join Sites on EntityID = SiteID and EntityTypeID = 13
    
    +-------------+------------+
    | Detail      | SupplierID |
    +-------------+------------+
    | abc@abc.com |         11 |
    | abc@abc.com |         11 |
    | xyz@abc.com |         12 |
    | abc@abc.com |         12 |
    +-------------+------------+

    You can apply the techniques I discussed to this result set. For example:

    select min(CommunicationID) as MinID,
          Detail, coalesce(SupplierID, EntityID) as SupplierID
    from Communications
        left outer join Sites on EntityID = SiteID and EntityTypeID = 13
    group by Detail, coalesce(SupplierID, EntityID)
    having count(*) > 1;
    
    +-------+-------------+------------+
    | MinID | Detail      | SupplierID |
    +-------+-------------+------------+
    |     3 | abc@abc.com |         11 |
    +-------+-------------+------------+

    I believe denormalized schema is causing your problems here. Any time you have a column that means more than one thing depending on the value of another column, you may have a troublesome design. You should probably redesign these tables. MySQL’s site has one of the best introductions to normalization I have seen.

    Xaprb

    12 Nov 06 at 11:13 pm

  13. Thanks. found what I was looking for in your example. You saved my life today.

    Anonymous

    14 Nov 06 at 11:43 am

  14. SELECT *
    FROM TABLE tbl1
    WHERE EXISTS (
      SELECT
        NULL
      FROM
        TABLE tbl2
      WHERE
        tbl2.ID1= tbl1.ID1
        AND tbl2.ID2= tbl1.ID2
      GROUP BY
        tbl2.ID1, tbl2.ID2
      HAVING
        tbl1.ID < MAX(tbl2.ID)
    )

    Brandon

    14 Nov 06 at 7:03 pm

  15. Here is another approach:

    select distinct all the duplicate rows into another table,
    delete the rows from the original table,
    insert the rows from the distinct rows table back into the original table.

    Brian Ganly

    14 Dec 06 at 5:54 am

  16. Thanks saved me a lot of time

    Clift Maples

    15 Dec 06 at 11:00 am

  17. Hi there, i got some problem here and i hope i got your attention. I’ve got two table, attendance and subject table.

    [ed: snipped tables out and re-worded some things for clarity]

    i want to compare both table at ‘metric_no’.
    And i want only those not duplicate result to show.

    basically what i want to accomplish here is to compare the ‘attendance’ table, data from date_range1 to date_range2 against table ’subject_table’.

    So, if there is duplicate, i want to remove it (just don’t show it, not completely remove from database).
    If there is another way to archive this please teach me.

    khalil

    24 Dec 06 at 12:06 pm

  18. Hi hkalil, your problem is about selecting non-matching rows, not removing duplicate rows. I use an exclusion join for this.

    Xaprb

    24 Dec 06 at 3:33 pm

  19. Hi, I am having a little trouble understanding the query to find the duplicate entries. i have 1 table called Details with column headings ID and Date. I need to find when the ID and corresponding date are the same: for example,

    ID        date
    2      2005-01-01
    2      2005-01-01

    So i need to see the ID and date when they are exactly the same. Im trying to find errors where a transaction goes thru twice on accident. PLease help thanks

    mwq27

    10 Jan 07 at 7:01 pm

  20. Hi mwq27, in your case you should define “duplicate” as “has the same pair of values in (ID, date)” (see the first sub-section in the article). Then you GROUP BY those columns together, and add a HAVING clause to see only rows that are repeated. For example:

    select ID, date, count(*)
    from yourtable
    GROUP BY ID, date
    HAVING count(*) > 1

    I hope that helps!

    Xaprb

    11 Jan 07 at 9:25 am

  21. hi ,

    I have doubt imagine a scenario where my table doesn’t have a Primary key in that case how do I:

    1. Display only the duplicate records

    2. Delete these records only.

    pk340

    19 Jan 07 at 10:25 am

  22. Lack of a primary key ALWAYS makes it harder. You can display the duplicates easily with the instructions in this post. To delete them, follow the instructions in the next post (there is a link on this page) but you will have to use a LIMIT clause to avoid deleting every row that’s duplicated (I assume you only want to delete some of them, so you keep your data but just delete the duplicate rows).

    Xaprb

    19 Jan 07 at 3:24 pm

  23. My last example had a lot of assumed knowledge. Here is a worked example from a real life situation.

    \d system_size
                      Table "system_size"
     Attribute  |     Type      | Modifier | Default Value
    ------------+---------------+----------+---------------
     date_stamp | date          | not null |
     db_name    | character(32) | not null |
     table_name | character(32) | not null |
     byte_size  | numeric(18,0) | not null |

    nb pk should be first 3 columns but DB does not enforce pk constraints

    --Find days with duplicates, e.g. days which have about double the number of rows of other days.
    SELECT DATE_STAMP,
           COUNT(* )
    FROM     SYSTEM_SIZE
    GROUP BY DATE_STAMP
    ORDER BY 1;
    
    --Create backup
    CREATE TABLE SYSTEM_SIZE_BAK
     AS
      SELECT *
      FROM   SYSTEM_SIZE;
    
    --Take out rows of duplicate day into temp table for days with duplicates
    CREATE TABLE BB1
     AS
      SELECT DISTINCT *
      FROM   SYSTEM_SIZE
      WHERE  DATE_STAMP IN ('2007-02-05','2007-02-06');
    
    --Find duplicates based on logical pk, e.g same pk different bytes values
    SELECT *
    FROM   BB1
    WHERE  (DATE_STAMP,DB_NAME,TABLE_NAME) IN (SELECT   DATE_STAMP,
                                                        DB_NAME,
                                                        TABLE_NAME
                                               FROM     BB1
                                               GROUP BY DATE_STAMP,DB_NAME,TABLE_NAME
                                               HAVING   COUNT(* ) > 1);
    
    --Show rows rows to delete
    SELECT *
    FROM     BB1
    WHERE    (DATE_STAMP,DB_NAME,TABLE_NAME,BYTE_SIZE) IN (SELECT   DATE_STAMP,
                                                                    DB_NAME,
                                                                    TABLE_NAME,
                                                                    MIN(BYTE_SIZE)
                                                           FROM     BB1
                                                           GROUP BY DATE_STAMP,DB_NAME,TABLE_NAME
                                                           HAVING   COUNT(* ) > 1)
    ORDER BY 1,2,3;
    
    --delete duplicate rows, if byte_size different made decision to keep the highest byte value
    DELETE FROM BB1
    WHERE       (DATE_STAMP,DB_NAME,TABLE_NAME,BYTE_SIZE) IN (SELECT   DATE_STAMP,
                                                                       DB_NAME,
                                                                       TABLE_NAME,
                                                                       MIN(BYTE_SIZE)
                                                              FROM     BB1
                                                              GROUP BY DATE_STAMP,DB_NAME,TABLE_NAME
                                                              HAVING   COUNT(* ) > 1)
    
    --check all is there
    SELECT DATE_STAMP,
           COUNT(* )
    FROM     BB1
    GROUP BY DATE_STAMP
    ORDER BY 1;
    
    --Truncate original table and insert back rows that should be there from backup and deduped table
    TRUNCATE TABLE SYSTEM_SIZE;
    
    INSERT INTO SYSTEM_SIZE
    SELECT *
    FROM   SYSTEM_SIZE_BAK
    WHERE  DATE_STAMP NOT IN ('2007-02-05','2007-02-06');
    
    INSERT INTO SYSTEM_SIZE
    SELECT *
    FROM   BB1;
    
    --Find days with duplicates, check again
    SELECT DATE_STAMP,
           COUNT(* )
    FROM     SYSTEM_SIZE
    GROUP BY DATE_STAMP
    ORDER BY 1;

    Brian Ganly

    27 Feb 07 at 12:28 pm

  24. Nice Tutorial

    Kunal

    1 Mar 07 at 2:36 am

  25. I have a table with 3 columns [k1(primary key),k2,k3).
    There are some rows that although they have different value in k1 they have identical k2,k3 columns with some other rows above.I want to delete these duplicate rows.
    Using (select distinct k2,k3) i get the rows that i want to keep but i cant find a way to delete the remaining rows.
    Any help is welcome
    thanks

    LamKaran

    4 Mar 07 at 7:19 pm

  26. Please see my later posts on deleting duplicate rows.

    Xaprb

    4 Mar 07 at 8:39 pm

  27. Very useful. Thanks very much for sharing your knowledge.

    Matt

    6 Mar 07 at 4:02 pm

  28. It helps me lot. Thanks.

    Devkant

    14 Mar 07 at 2:15 am

  29. Xaprb…I have a question..using your above example of finding the duplicate rows showing the “day” and the “count”….

    select day, count(*) from test group by day HAVING count(*) < 1;
    +------------+----------+
    | day        | count(*) |
    +------------+----------+
    | 2006-10-08 |        2 |
    +------------+----------+

    Is there a way to show all the fields that are duplicates just like below? I want the ID, day field to show…is it possible ??

    +----+------------+
    | id | day        |
    +----+------------+
    |  1 | 2006-10-08 |
    |  2 | 2006-10-08 |
    +----+------------+

    Ekin

    16 Mar 07 at 5:18 pm

  30. I think the queries I showed in my post on first/least/max row per group will help you.

    Xaprb

    17 Mar 07 at 8:21 am

  31. Hi

    Hopefully someone can help with this little problem I’m having. I’m removing duplicate rows from a table by grouping five VARCHAR columns together and then using count(*) > 1. This is working fine except for when there are NULL values involved.

    For example, if column C is NULL in rows 2 and 3 and columns A, B, D and E are contain identical strings, I want this grouping of values to be marked as a duplicate. But this is not happening. MYSQL doesn’t seem to be seeing that NULL is the same as NULL. If I change the NULL values in rows 2 and 3 to a space instead, the process works fine and the grouping is returned as a duplicate.

    Hopefully that makes sense. Thank you for any help.

    Martin

    13 Apr 07 at 10:05 am

  32. Your trouble probably stems from misunderstanding NULL comparisons, which also affect COUNT(). I wrote an article about why NULL never compares false to anything in SQL. That might help you understand what is happening.

    Xaprb

    13 Apr 07 at 10:09 am

  33. Hi

    i got a delivery events table, where parcels were failed or delivered and other event logs, the problem i have is a parcel could be delivered twice on the same day but different times, this is due to the driver input on his hand held machine, or other times where the system does this. i have a table where the columns for time and date are different and just what the records of parcels that had the first log on the same day and not the second log. any help would be great

    cheers suds

    sud desai

    16 Apr 07 at 11:44 am

  34. [...] For Reference please see this link : Duplicate Row Delete [...]

  35. This has been extremely helpful. Thank you for taking the time to explain this in such detail.

    Joe Schmoe

    2 May 07 at 11:26 am

  36. I really enjoyed this page, I found it looking for the answer to this question:
    “Why do you have to use a HAVING clause, why can’t you use a WHERE clause?”

    I think I understand your answer … the order is important, but what if you put the where clause before the group by clause, in the correct order? For example, this would seem to be logically correct:

    Select state, count(state) from customers
    where count(state) > 1
    group by state;

    But it doesn’t work. However, this does:

    Select state, count(state) from customers
    group by state
    having count(state) > 1;

    Why?

    Scott

    3 May 07 at 11:26 am

  37. Hi Scott,

    It works because the clauses are applied at different points. WHERE filters rows before they are grouped, HAVING filters them after. COUNT() is applied over a group, so there is no COUNT() before the rows are grouped. That’s why you can’t refer to a grouped expression in the WHERE clause — the rows haven’t been grouped yet.

    Xaprb

    3 May 07 at 5:20 pm

  38. Thanks for writing this blog post!

    David Eriksson

    22 May 07 at 9:15 am

  39. [...] is fairly hardcore, if the identify and delete approach is used – an approach described in detail here. An other approach to the problem of duplicates is to simply add a UNIQUE index to the current [...]

  40. Hi this is a very useful blog. Thanks for posting. I have a question about developing a query to identify the duplicate records. I have built a view with the following fields: Site ID, Site Name, Address Line 1, City, State, Industry. Records are duplicates if more than one Site Name and City are exactly the same.

    Using the info in other posts I can find the duplicates if i only select the Site Name and City in for the report. But I also have to include all other fields as well (Site ID, Site Name, Address Line 1, City, State, Industry).

    How can use the group by and have all the fields in my report? Thanks in advance for your help.

    axt

    3 Aug 07 at 7:59 pm

  41. You need to create a set of duplicate Site_Name and City values, and then join that back to the entire data set. The easiest way is with a subquery in the FROM clause:

    select whole_thing.*
    from whole_thing
    inner join (
       ... your duplicate-finding query here ...
    ) as dupes using(Site_Name, City);

    Xaprb

    3 Aug 07 at 8:03 pm

  42. Thanks Xaprb, you’re a genius. I couldn’t figure out the syntax for the inner join duplicate finding query in SQL Server so I created a new view for the duplicate finding query and the did the inner join and it works like a charm.

    axt

    7 Aug 07 at 3:05 pm

  43. hi.. This is very useful blog.. i findout very useful query.. i need query for finding out duplicate rows from multiple tables…

    Imthiyas

    11 Sep 07 at 6:41 am

  44. Hi, I foound this is a very nice site for sql. I have a small query. In my table there are three fields CustNo, IdNo, Cust Desc. There can be any number IdNo for the CustNo. But the IdNo assigned to one customer can not be assigned to another. Now which is the best query to check if there are any IdNo common to more than one customer. Here CustNo can be repreated and IdNo Can be also be repeated any number of times in the table.. Thanks..

    nan

    22 Sep 07 at 1:50 am

  45. Looks like you found the right place! It’s all on this page.

    Xaprb

    22 Sep 07 at 7:50 am

  46. Hi all sql guru, please help me with the following:

    Based from the data below, I need to get all the LAST transaction per compound keys(SPID NPA NXX Line) and status != ‘Delete’

    Date SPID NPA NXX Line Status
    9-23-2007 08088 204 987 9531 Delete
    9-23-2007 08088 204 987 9532 Input
    9-23-2007 08088 204 987 9533 Delete
    9-23-2007 08088 204 987 9533 Add
    9-23-2007 08089 204 977 2954 Add
    9-23-2007 08089 204 977 2954 Delete
    9-23-2007 08089 204 977 2954 Add
    9-23-2007 08090 204 977 2956 Add
    9-23-2007 08090 204 977 2956 Delete

    Target Result:
    9-23-2007 08088 204 987 9532 Input
    9-23-2007 08088 204 987 9533 Add
    9-23-2007 08089 204 977 2954 Add

    smith

    24 Sep 07 at 7:25 am

  47. I am still confused . My table is like this ….

    IdNo CustId CustDesc
    1 C1 Customer1
    2 C1 Customer1
    3 C1 Customer1
    4 C2 Customer2
    5 C2 Customer2
    6 C2 Customer2
    2 C2 Customer2
    3 C3 CuStomer3
    7 C3 Customer3

    In the above example given I would like to see all the IdNos with different customers. My result will be IdNo’s 2 and 3. Thanks

    nan

    25 Sep 07 at 2:16 am

  48. I have 1 table I need to find all duplicates from 3 columns (address1 city state) so I created a view and concatenated those 3 columns into 1 new called address3. Each time I run a simple select distinct address3 query, I get a different result. Is there a better way to find and delete these duplicates? Thank you.

    Nancy

    5 Oct 07 at 10:36 am

  49. Thanks its a great article……..

    i used this query

    ” select b, c, count(*) from a_b_c group by b, c having count(1) or count(1); “

    it worked well

    but my the problem is that its also showing the null values too as duplicates ….. IS there any way to avoide this…….?

    thanks

    Moon

    17 Oct 07 at 11:49 pm

  50. A WHERE clause should fix that. Remember col = NULL is always false.

    Xaprb

    18 Oct 07 at 6:23 am

  51. Xaprb………….thanks 4 the reply…….

    can u give an example ……….

    thanks

    moon

    22 Oct 07 at 2:51 am

  52. A helpful way to get the index without too much trouble is to use MAX()…

    select max(article_id) from articles group by headline HAVING count(*) > 1

    Donovan

    15 Nov 07 at 8:14 pm

  53. Does anyone want to help me?

    I’m new to SQL, an absolute beginner. I can do what I want to do in basic, but can’t access the databases that way!

    Anyway, I want to isolate the dublicates across two tables.
    The entries all have a unique ID, but I need a combination of the two to get what I need (All transactions, I need both ins and outs if you will!) Whatever it is I’ve done, every transaction is repeating 16 times (Why 16? – I have no idea either!) So if anyone reading this knows how to do the above but across two tables, please let me know and I’ll be your friend for life!

    Bebop

    27 Mar 08 at 7:03 am

  54. I had a table of about 800 rows with a small number of accidental partial duplicates (most fields the same, one field different), and I wanted to see the duplicates myself before choosing one to delete manually.

    A similar query might be useful in e.g. the situation where someone’s accidentally put in one or more slightly-wrong phone numbers, when the right ones were already in the table. You don’t want to just randomly delete one of the entries, because they’re not 100% identical – you want to check first which one is correct.

    In my case, the problem arose from a combination of user input error and an automatic numbering system, so that a few entries had somehow acquired duplicates in everything but the auto-number. (In retrospect I could have told MySQL to go ahead and delete the duplicates – I was just being on the safe side to make sure I understood what was going on.)

    It took me a little while to figure out a query that would work for my purpose, but here is what seemed to do the trick:

    SELECT * FROM table_name t1, table_name t2 WHERE (t1.apple=t2.apple AND t1.banana=t2.banana AND t1.cherry=t2.cherry AND t1.differing_one!=t2.differing_one)

    Just thought I’d copy it here in case it was useful to someone.

    Jennifer

    4 Feb 09 at 5:54 am

  55. Quote:
    More portable query is:
    select min(id), day from test group by day

    Heres how i used it for my problem in MS Access!!

    SELECT userid, username
    FROM usertable
    WHERE id in (
    SELECT min(id)
    FROM usertable
    GROUP BY userid
    );

    Kang

    11 Mar 09 at 4:34 am

  56. Quite helpful topic to be honest, so thank you!

    For those who want to get the related rows (duplicate rows related to the mother row lets say) may find this code useful…:

    select test.minID, test.ID from test
    inner join (
    select day, min(ID) as minID
    from test group by day HAVING count(*) > 1
    ) as test2 on test2 .day= test.day and test2 .minIDtest.ID –ignore self linking
    order by minID

    Basically we use the same table to create an ID to ID relation of the “primary” records that are the unique ones with their respective duplicate rows…(it was an exerciseI was given recently..)

    AlbDev

    17 Mar 09 at 6:47 am

  57. there is “” missing in the join clause …
    so it should be:

    select test.minID, test.ID from test
    inner join (
    select day, min(ID) as minID
    from test group by day HAVING count(*) > 1
    ) as test2 on test2 .day= test.day and test2 .minID test.ID –ignore self linking
    order by minID

    AlbDev

    17 Mar 09 at 6:50 am

  58. it is not working is it???
    basically test2.minID should be different from test.ID

    Looks like the TextArea is ignoring the different (not equal) operator …

    cheers

    AlbDev

    17 Mar 09 at 6:52 am

  59. I have a rather more challenging and kind of weird requirement (but it’s a valid business requirement):(

    I have already spent unfruitful two days two find a query to eliminate duplicates. The queries given in your article are not working (or I could not tweak them to work:( ) and frankly I am not really an expert:(

    Here is the table structure:
    MYTABLE (id int primary field, firstname varchar, pid int, tel_number int)

    Sample data:
    ——————————–
    ID firstname pid tel_number
    ——————————–
    1 ‘Nitin’ 10 5203090
    2 ‘Nitin’ 11 5203090
    3 ‘Patil’ 10 5203091
    4 ‘Patil’ 11 5203095
    5 ‘Titli’ 15 5203099
    ——————————–

    I want a query which returns following:

    1 ‘Nitin’ 10 5203090
    3 ‘Patil’ 10 5203091
    5 ‘Titli’ 15 5203099

    i.e. the duplicates should be based on FirstName field. The filter to be used for firstname is like ‘%ti%’. The query should not care about ID, pid, tel_number fields. It should return the record for first (or any) occurrence of FirstName.

    I even don’t know whether this is possible using single SQL statement.
    I don’t want to use any stored procedure. The table structure is fixed and can’t be altered. The data from all the fields is necessary (it may be any one row in case of duplicate firstname).
    What do you think about this?

    Nitin

    15 Jun 09 at 1:57 pm

  60. [...] dirla tutta questo post deve molto a un post analogo in inglese [↩] Tag: lavoro, [...]

Leave a Reply