How to Remove duplicate files rapidly?
-
The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?
You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
-
The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?
It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them... If you have large number of files you have to rethink your approach. 1. use file system's FileInfo - name, size, creation, last modified and so 2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
-
It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them... If you have large number of files you have to rethink your approach. 1. use file system's FileInfo - name, size, creation, last modified and so 2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
-
You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
-
Thanks for your advice. However i wonder why SHA2 should be a better choice which is slower than MD5 or MD4?
It's been developed to be more robust, i.e. the probability to have collisions is reduced. Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are. Cheers!
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
-
It's been developed to be more robust, i.e. the probability to have collisions is reduced. Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are. Cheers!
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
-
Thank you V. ^_^ Your advice is good. In the beginning i had not a clear understanding of this question. Now i agree I/O should be the key point, and i wish i can find a way to resolve my problem. Thanks!
Yes I/O is a key issue... If it is in your power you may move the storage to a more capable one, like SCSI or flash, that should speed up I/O...
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
-
The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?
i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.
-
i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.
You may give a polish to that code and publish it here as a tip or article...
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
-
i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.
I agree one can have mupltiple type of algorithms based on suitability: 1. check for size if does not matches then those files are not same. 2. Compare 2% character from begining and then from end if those does not matches then files are not same. 3. Finally it can be once option to simply compare the remaining part of both files or use CRC, checksum, MD SHA. If MD and SHA or not already calculated and stored then those will be a bit costly.
Manoj Never Gives up
-
The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?
-
You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
i found a similar solution when i had to recover 4.5TB of files mostly between .5 - 15GB where around 1-2% of them were early terminated during a copy process, but left with correct filesize filled up with zeroes. it adds to your suggestion to scan all big files with big fixed and some random jumps up to their ends.