Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Database & SysAdmin
  3. Database
  4. large scale inserts duplicate checking

large scale inserts duplicate checking

Scheduled Pinned Locked Moved Database
helpdatabase
2 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B Offline
    B Offline
    brsecu
    wrote on last edited by
    #1

    I have 2 databases with over 7 million records where about half are probably duplicated. I would like to merge these tables and take out the duplicates. The problem comes in where the the only fields are name address and email address(sometimes). So i have to do a dup check on the name and address. The way im doing it is taking 5 min for every 1000 records. I tried writing a stored procedure that checked each one with a if exists statement then insert. I know there has to be a better way than that. ANy help or am i stuck waiting a couple weeks to be done. By the way the cpu i'm using is also using 85% cpu to do the checking.

    M 1 Reply Last reply
    0
    • B brsecu

      I have 2 databases with over 7 million records where about half are probably duplicated. I would like to merge these tables and take out the duplicates. The problem comes in where the the only fields are name address and email address(sometimes). So i have to do a dup check on the name and address. The way im doing it is taking 5 min for every 1000 records. I tried writing a stored procedure that checked each one with a if exists statement then insert. I know there has to be a better way than that. ANy help or am i stuck waiting a couple weeks to be done. By the way the cpu i'm using is also using 85% cpu to do the checking.

      M Offline
      M Offline
      Michael Potter
      wrote on last edited by
      #2

      The answer to your question would depend on where your duplicates can exist. Are there duplicates inside of the individual databases or only between the two databases? If it is the latter you can use a strategy like this: 1) Copy over 1 table to a new table. 2) The new table should have indexes on Name & Address (to avoid mulitple table scans). For speed you can create the indexes after the copy. 3) Create an insert statement qualified by the non-duplicates.

      INSERT INTO [NewTable]
          (Name, Address, Email, etc...)
      SELECT
          s.Name, 
          s.Address, 
          s.Email, 
          s.etc...
      FROM 
          [2ndTable] s
      LEFT JOIN 
          [NewTable] n
          ON ((s.Name = n.Name)&&(s.Address = n.Address))
      WHERE
          n.Name IS NULL
      
      1 Reply Last reply
      0
      Reply
      • Reply as topic
      Log in to reply
      • Oldest to Newest
      • Newest to Oldest
      • Most Votes


      • Login

      • Don't have an account? Register

      • Login or register to search.
      • First post
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • World
      • Users
      • Groups