How to Remove duplicate files rapidly?

Manfred Rudolf Bihy

You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred

"I had the right to remain silent, but I didn't have the ability!"

Ron White, Comedian

Kornfeld Eliyahu Peter

It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them... If you have large number of files you have to rethink your approach. 1. use file system's FileInfo - name, size, creation, last modified and so 2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.

I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

happy liu

Thank you V. ^_^ Your advice is good. In the beginning i had not a clear understanding of this question. Now i agree I/O should be the key point, and i wish i can find a way to resolve my problem. Thanks!

happy liu

Thanks for your advice. However i wonder why SHA2 should be a better choice which is slower than MD5 or MD4?

Manfred Rudolf Bihy

It's been developed to be more robust, i.e. the probability to have collisions is reduced. Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are. Cheers!

"I had the right to remain silent, but I didn't have the ability!"

Ron White, Comedian

happy liu

Thanks! ^_^

Kornfeld Eliyahu Peter

Yes I/O is a key issue... If it is in your power you may move the storage to a more capable one, like SCSI or flash, that should speed up I/O...

I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

Chris Losinger

i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.

image processing toolkits | batch image processing

Kornfeld Eliyahu Peter

You may give a polish to that code and publish it here as a tip or article...

I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

Manoj Kumar Rai

I agree one can have mupltiple type of algorithms based on suitability: 1. check for size if does not matches then those files are not same. 2. Compare 2% character from begining and then from end if those does not matches then files are not same. 3. Finally it can be once option to simply compare the remaining part of both files or use CRC, checksum, MD SHA. If MD and SHA or not already calculated and stored then those will be a bit costly.

Manoj Never Gives up

theafien

store information as bytes size and simple crc code. performe loop and initially check bytes size if equal performe crc or hash check. Or too do hash checking only when bytes then equal storing hash for future use.

saephoed

i found a similar solution when i had to recover 4.5TB of files mostly between .5 - 15GB where around 1-2% of them were early terminated during a copy process, but left with correct filesize filled up with zeroes. it adds to your suggestion to scan all big files with big fixed and some random jumps up to their ends.