Best approach to compare files

MKC002

If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.

sangamdumne

Hi friend, use findfirstfile() and findnextfile() msdn api's.....hope this will help you

MKC002

Hi, Thanks for your reply. FindFirst/Next file will give me file name. Here i asked about the approach i.e. container and logic to implement comparision which should be faster and reliable.

enhzflep

It really depends on what kind a value you wish to return from the function. For instance I can envisage two basic approaches - true/false for same/not same, or a -/0/+ value as strcmp does. Here's an approach that just returns true/false. I've timed it - it takes 6 seconds to compare a 347MB video file with itself - Win7 Home Premium x64, Intel i3 @ 2.13Ghz, 4GB ram - gcc 4.4.1

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void loadFile(char *szFilename, size_t &sizeOut, char* &bufferOut)
{
FILE *fp;

fp = fopen(szFilename, "rb");
fseek(fp, 0, SEEK\_END);
sizeOut = ftell(fp);
fseek(fp, 0, SEEK\_SET);
bufferOut = (char\*) malloc(sizeOut);
fread(bufferOut, 1, sizeOut, fp);
fclose(fp);

}

bool isSame(char *szFilename1, char *szFilename2)
{
size_t len1, len2;
long curPos;
char *buffer1, *buffer2;
bool result = true;

loadFile(szFilename1, len1, buffer1);
loadFile(szFilename2, len2, buffer2);

if (len1 == len2)
{
    for (curPos=0; curPos<len1; curPos++)
    {
        if (buffer1\[curPos\] != buffer2\[curPos\])
            result = false;
    }
}

else
    result = false;

free(buffer1);
free(buffer2);
return result;

}

int main()
{
if (isSame("testVideo.avi", "testVideo.avi"))
printf("Same\n");
else
printf("Not Same\n");
}

Chris Losinger

for SuperDuper[^] i did this: 0. build the list of files to compare. 1. read the first 1000 bytes of each file to compare. 2. from those 1000 bytes, generate an SHA1 hash. 3. store the hashes, file size, and file names in a list 4. sort the list by hash values. duplicates will clump together. 5. for hash matches, do a full file compare.

image processing toolkits | batch image processing

Lost User

Chris Losinger wrote:

2. from those 1000 bytes, generate an SHA1 hash.
3. store the hashes, file size, and file names in a list

Why would you want to waste all those cpu cycles by calculating a SHA1 hash? If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm... there are dozens of inexpensive hash algorithms that will have a collision probability of zero within 1000 bytes. Your application would probably have a huge performance boost. Not to mention... the SHA1 hash will produce 20 bytes of data to store... you could reduce your database size by 50% just by moving to a more efficient algorithm. Were you initially planning on hashing the entire file or something? Best Wishes, -David Delaune

Chris Losinger

it's actually an MD5, but can be switched to SHA1 or whatever, at compile time. (wrote the OP from memory, forgot the default).

Randor wrote:

If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm.

probably. but the goal was to avoid collisions while using a hash method that was built into the .Net framework. this was a quick utility project. and really, calculating MD5 on 1K is definitely not the bottleneck here; the file scanning and open/read/close totally dominates the process - especially when those files are on a network drive somewhere.

Randor wrote:

Were you initially planning on hashing the entire file or something?

actually, yes. (and that's what happens when the 1K scan finds matches). the partial hash pre-scan was a later addition, and i just reused the hash stuff that was already written instead of coming up with a new method.

image processing toolkits | batch image processing