compare files for duplicates
-
Hi, Can someone point me in the right direction on how to search through all files over multiple drives to look for duplicates? Thanks in advance!
Hi, 1. there must be lots of utilities around that do this, some even for free. 2. if I were to develop one, I would enumerate all files and calculate a checksum for each; then investigate files with matching checksums only. Warning: different partitions/devices can use different file systems, resulting in slightly different dates (e.g. FAT is accurate up to 2 seconds only), and maybe even slightly different sizes. [ADDED]Also Daylight Saving Time conventions may be different on different machines on a network.[/ADDED] Anyway, this would take a while, as all data has to be read from the disk(s) to do it properly. One can not simply rely on file names! :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]
I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.
All Toronto weekends should be extremely wet until we get it automated in regular forums, not just QA.
modified on Sunday, February 14, 2010 6:08 PM
-
Hi, 1. there must be lots of utilities around that do this, some even for free. 2. if I were to develop one, I would enumerate all files and calculate a checksum for each; then investigate files with matching checksums only. Warning: different partitions/devices can use different file systems, resulting in slightly different dates (e.g. FAT is accurate up to 2 seconds only), and maybe even slightly different sizes. [ADDED]Also Daylight Saving Time conventions may be different on different machines on a network.[/ADDED] Anyway, this would take a while, as all data has to be read from the disk(s) to do it properly. One can not simply rely on file names! :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]
I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.
All Toronto weekends should be extremely wet until we get it automated in regular forums, not just QA.
modified on Sunday, February 14, 2010 6:08 PM
-
Hi, I know there are a few tools around that can do it. I'm just in it for the experience. What do you mean with calculating a checksum?
something like a longitudinal checksum, summarizing all the data into a short number, so you can store it in memory and compare. Equal files are bound to have equal checksums, and equal checksums probably (but not absolutely) indicate identical file contents. Google or other search engines should be your friend; you might read this[^] and more. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]
I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.
All Toronto weekends should be extremely wet until we get it automated in regular forums, not just QA.
-
Hi, Can someone point me in the right direction on how to search through all files over multiple drives to look for duplicates? Thanks in advance!
Comparing file contents is slow, so you only want to do it when you have to. You can go through the drives and build a hash table of files and their sizes. Different sizes mean no duplicates. For the files that have the same size, the next step is to look at contents (I'm assuming you also want to detect duplicate contents stored under different names, or with different times.) You could store check sums with the files, as Luc suggested, but check sums could be wrong for similar files. They can be used for avoiding comparing file contents: If the check sums are different, the files are different. But if they're the same, you have to look at the contents for confirmation.