Siafoo is here to make coding less frustrating and to save you time. Join Siafoo Now or Learn More

Removing duplicates from a PostgreSQL database Atom Feed 0

Your intern imported the same data multiple times, and now you have duplicate textual content in your database? Here's one way to remove it.

The basic assumptions are that you have a table, named foo, that contains a text field, called contents and that the contents field may contain duplicate rows.

For my purposes I am using a table with 30,000 rows of which 8,000 rows must be removed because they duplicate content one or more times. The contents field contains an average of only 180 characters making the hash function very fast but also probably unnecessary.

  1. First, as always, backup your database.

  2. Add an md5 hash key and populate it with the content hashes (~2 seconds for my data).

    # 's
    1ALTER TABLE foo ADD COLUMN hash text;
    2UPDATE foo SET hash = md5(content);
  3. (optional) Create a unique id (i.e. like a primary key) for each row, if one does not exist.

    # 's
    1CREATE SEQUENCE foo_sequence
    2 INCREMENT 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1;
    3-- Set the correct owner for the sequence if needed
    4-- ALTER TABLE foo_seq OWNER TO postgres;
    6-- Populate the field
    7ALTER TABLE foo ADD COLUMN id integer;
    8UPDATE foo SET id = nextval('foo_seq');
  4. (optional) Estimate the number of duplicates

    You can use the query below to estimate the number of duplicates, so that once the deletion takes place you can figure out if something has gone horribly wrong™.

    # 's
    1SELECT sum(dupes.c) - count(dupes.c)
    2FROM (SELECT count(*) AS c
    3 FROM foo
    4 GROUP BY hash
    5 HAVING count(*) > 1
    6 ) AS dupes
  5. Delete the duplicates (~3 minutes on my local machine, a dual core AMD)

    # 's
    1DELETE FROM foo
    2 WHERE id NOT IN (SELECT min(id)
    3 FROM foo
    4 GROUP BY hash HAVING count(*) >= 1)
  6. Drop the hash column

    # 's

Naturally you can modify the above procedure to better fit your data definition and uniqueness requirements. It is also probable that you can achieve some speed ups by using indexes... somewhere.

As a side note this is the actual code I used to remove the duplicate fortunes from the siafoo database (except for step 3).

Feel free to add better duplicate removal methods to this article.