large scale inserts duplicate checking

brsecu

I have 2 databases with over 7 million records where about half are probably duplicated. I would like to merge these tables and take out the duplicates. The problem comes in where the the only fields are name address and email address(sometimes). So i have to do a dup check on the name and address. The way im doing it is taking 5 min for every 1000 records. I tried writing a stored procedure that checked each one with a if exists statement then insert. I know there has to be a better way than that. ANy help or am i stuck waiting a couple weeks to be done. By the way the cpu i'm using is also using 85% cpu to do the checking.

Michael Potter

The answer to your question would depend on where your duplicates can exist. Are there duplicates inside of the individual databases or only between the two databases? If it is the latter you can use a strategy like this: 1) Copy over 1 table to a new table. 2) The new table should have indexes on Name & Address (to avoid mulitple table scans). For speed you can create the indexes after the copy. 3) Create an insert statement qualified by the non-duplicates.

INSERT INTO [NewTable]
    (Name, Address, Email, etc...)
SELECT
    s.Name, 
    s.Address, 
    s.Email, 
    s.etc...
FROM 
    [2ndTable] s
LEFT JOIN 
    [NewTable] n
    ON ((s.Name = n.Name)&&(s.Address = n.Address))
WHERE
    n.Name IS NULL