Posts made by JasonSQ

JasonSQ

I had a similar issue where I needed to import 10's of thousands of records every day, most of them duplicates. We only inserted if the record didn't already exist. It was a serious performance bottleneck. I generated hashes for all of the existing rows and added an index for the new hash column. Now I calculate the hash of the values in each new row to be inserted and compare the hash to existing rows. I skip the insert if an existing identical row already exists. Using the index solved the insert performance problem for me. I don't recommend using MD5 or similar simple hashes because it's too easy to create duplicate hashes.

JasonSQ

File size is critically important. If you're breaking across the block size by just a little bit, the rest of the block is dead space. Assuming 4k block size and files storing 1K of data. That's 3k of wasted space on disk, per file. If you zip up the files, they'll store much, much more efficiently. We have this problem with hundreds of thousands of small text files. We sweep them up and zip them into archive folders on occasion to clean up the folders and reclaim disk space.