Best approach to compare files
-
If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.
-
If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.
Hi friend, use findfirstfile() and findnextfile() msdn api's.....hope this will help you
-
Hi friend, use findfirstfile() and findnextfile() msdn api's.....hope this will help you
-
Hi, Thanks for your reply. FindFirst/Next file will give me file name. Here i asked about the approach i.e. container and logic to implement comparision which should be faster and reliable.
It really depends on what kind a value you wish to return from the function. For instance I can envisage two basic approaches - true/false for same/not same, or a -/0/+ value as strcmp does. Here's an approach that just returns true/false. I've timed it - it takes 6 seconds to compare a 347MB video file with itself - Win7 Home Premium x64, Intel i3 @ 2.13Ghz, 4GB ram - gcc 4.4.1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>void loadFile(char *szFilename, size_t &sizeOut, char* &bufferOut)
{
FILE *fp;fp = fopen(szFilename, "rb"); fseek(fp, 0, SEEK\_END); sizeOut = ftell(fp); fseek(fp, 0, SEEK\_SET); bufferOut = (char\*) malloc(sizeOut); fread(bufferOut, 1, sizeOut, fp); fclose(fp);
}
bool isSame(char *szFilename1, char *szFilename2)
{
size_t len1, len2;
long curPos;
char *buffer1, *buffer2;
bool result = true;loadFile(szFilename1, len1, buffer1); loadFile(szFilename2, len2, buffer2); if (len1 == len2) { for (curPos=0; curPos<len1; curPos++) { if (buffer1\[curPos\] != buffer2\[curPos\]) result = false; } } else result = false; free(buffer1); free(buffer2); return result;
}
int main()
{
if (isSame("testVideo.avi", "testVideo.avi"))
printf("Same\n");
else
printf("Not Same\n");
} -
If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.
for SuperDuper[^] i did this: 0. build the list of files to compare. 1. read the first 1000 bytes of each file to compare. 2. from those 1000 bytes, generate an SHA1 hash. 3. store the hashes, file size, and file names in a list 4. sort the list by hash values. duplicates will clump together. 5. for hash matches, do a full file compare.
-
for SuperDuper[^] i did this: 0. build the list of files to compare. 1. read the first 1000 bytes of each file to compare. 2. from those 1000 bytes, generate an SHA1 hash. 3. store the hashes, file size, and file names in a list 4. sort the list by hash values. duplicates will clump together. 5. for hash matches, do a full file compare.
Chris Losinger wrote:
2. from those 1000 bytes, generate an SHA1 hash.
3. store the hashes, file size, and file names in a listWhy would you want to waste all those cpu cycles by calculating a SHA1 hash? If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm... there are dozens of inexpensive hash algorithms that will have a collision probability of zero within 1000 bytes. Your application would probably have a huge performance boost. Not to mention... the SHA1 hash will produce 20 bytes of data to store... you could reduce your database size by 50% just by moving to a more efficient algorithm. Were you initially planning on hashing the entire file or something? Best Wishes, -David Delaune
-
Chris Losinger wrote:
2. from those 1000 bytes, generate an SHA1 hash.
3. store the hashes, file size, and file names in a listWhy would you want to waste all those cpu cycles by calculating a SHA1 hash? If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm... there are dozens of inexpensive hash algorithms that will have a collision probability of zero within 1000 bytes. Your application would probably have a huge performance boost. Not to mention... the SHA1 hash will produce 20 bytes of data to store... you could reduce your database size by 50% just by moving to a more efficient algorithm. Were you initially planning on hashing the entire file or something? Best Wishes, -David Delaune
it's actually an MD5, but can be switched to SHA1 or whatever, at compile time. (wrote the OP from memory, forgot the default).
Randor wrote:
If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm.
probably. but the goal was to avoid collisions while using a hash method that was built into the .Net framework. this was a quick utility project. and really, calculating MD5 on 1K is definitely not the bottleneck here; the file scanning and open/read/close totally dominates the process - especially when those files are on a network drive somewhere.
Randor wrote:
Were you initially planning on hashing the entire file or something?
actually, yes. (and that's what happens when the 1K scan finds matches). the partial hash pre-scan was a later addition, and i just reused the hash stuff that was already written instead of coming up with a new method.