Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. Best approach to compare files

Best approach to compare files

Scheduled Pinned Locked Moved C / C++ / MFC
7 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    MKC002
    wrote on last edited by
    #1

    If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.

    S C 2 Replies Last reply
    0
    • M MKC002

      If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.

      S Offline
      S Offline
      sangamdumne
      wrote on last edited by
      #2

      Hi friend, use findfirstfile() and findnextfile() msdn api's.....hope this will help you

      M 1 Reply Last reply
      0
      • S sangamdumne

        Hi friend, use findfirstfile() and findnextfile() msdn api's.....hope this will help you

        M Offline
        M Offline
        MKC002
        wrote on last edited by
        #3

        Hi, Thanks for your reply. FindFirst/Next file will give me file name. Here i asked about the approach i.e. container and logic to implement comparision which should be faster and reliable.

        enhzflepE 1 Reply Last reply
        0
        • M MKC002

          Hi, Thanks for your reply. FindFirst/Next file will give me file name. Here i asked about the approach i.e. container and logic to implement comparision which should be faster and reliable.

          enhzflepE Offline
          enhzflepE Offline
          enhzflep
          wrote on last edited by
          #4

          It really depends on what kind a value you wish to return from the function. For instance I can envisage two basic approaches - true/false for same/not same, or a -/0/+ value as strcmp does. Here's an approach that just returns true/false. I've timed it - it takes 6 seconds to compare a 347MB video file with itself - Win7 Home Premium x64, Intel i3 @ 2.13Ghz, 4GB ram - gcc 4.4.1

          #include <stdio.h>
          #include <stdlib.h>
          #include <time.h>

          void loadFile(char *szFilename, size_t &sizeOut, char* &bufferOut)
          {
          FILE *fp;

          fp = fopen(szFilename, "rb");
          fseek(fp, 0, SEEK\_END);
          sizeOut = ftell(fp);
          fseek(fp, 0, SEEK\_SET);
          bufferOut = (char\*) malloc(sizeOut);
          fread(bufferOut, 1, sizeOut, fp);
          fclose(fp);
          

          }

          bool isSame(char *szFilename1, char *szFilename2)
          {
          size_t len1, len2;
          long curPos;
          char *buffer1, *buffer2;
          bool result = true;

          loadFile(szFilename1, len1, buffer1);
          loadFile(szFilename2, len2, buffer2);
          
          if (len1 == len2)
          {
              for (curPos=0; curPos<len1; curPos++)
              {
                  if (buffer1\[curPos\] != buffer2\[curPos\])
                      result = false;
              }
          }
          
          else
              result = false;
          
          free(buffer1);
          free(buffer2);
          return result;
          

          }

          int main()
          {
          if (isSame("testVideo.avi", "testVideo.avi"))
          printf("Same\n");
          else
          printf("Not Same\n");
          }

          1 Reply Last reply
          0
          • M MKC002

            If there are some directories and each directory contains some files then i want to make a program which will compare each file of any directory with each file of all directories. Can someone suggest me a best approach which works faster and dont miss any comparision.

            C Offline
            C Offline
            Chris Losinger
            wrote on last edited by
            #5

            for SuperDuper[^] i did this: 0. build the list of files to compare. 1. read the first 1000 bytes of each file to compare. 2. from those 1000 bytes, generate an SHA1 hash. 3. store the hashes, file size, and file names in a list 4. sort the list by hash values. duplicates will clump together. 5. for hash matches, do a full file compare.

            image processing toolkits | batch image processing

            L 1 Reply Last reply
            0
            • C Chris Losinger

              for SuperDuper[^] i did this: 0. build the list of files to compare. 1. read the first 1000 bytes of each file to compare. 2. from those 1000 bytes, generate an SHA1 hash. 3. store the hashes, file size, and file names in a list 4. sort the list by hash values. duplicates will clump together. 5. for hash matches, do a full file compare.

              image processing toolkits | batch image processing

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #6

              Chris Losinger wrote:

              2. from those 1000 bytes, generate an SHA1 hash.
              3. store the hashes, file size, and file names in a list

              Why would you want to waste all those cpu cycles by calculating a SHA1 hash? If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm... there are dozens of inexpensive hash algorithms that will have a collision probability of zero within 1000 bytes. Your application would probably have a huge performance boost. Not to mention... the SHA1 hash will produce 20 bytes of data to store... you could reduce your database size by 50% just by moving to a more efficient algorithm. Were you initially planning on hashing the entire file or something? Best Wishes, -David Delaune

              C 1 Reply Last reply
              0
              • L Lost User

                Chris Losinger wrote:

                2. from those 1000 bytes, generate an SHA1 hash.
                3. store the hashes, file size, and file names in a list

                Why would you want to waste all those cpu cycles by calculating a SHA1 hash? If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm... there are dozens of inexpensive hash algorithms that will have a collision probability of zero within 1000 bytes. Your application would probably have a huge performance boost. Not to mention... the SHA1 hash will produce 20 bytes of data to store... you could reduce your database size by 50% just by moving to a more efficient algorithm. Were you initially planning on hashing the entire file or something? Best Wishes, -David Delaune

                C Offline
                C Offline
                Chris Losinger
                wrote on last edited by
                #7

                it's actually an MD5, but can be switched to SHA1 or whatever, at compile time. (wrote the OP from memory, forgot the default).

                Randor wrote:

                If you know that the maximum number of bytes you're going to read is 1000 then you could have used an inexpensive hashing algorithm.

                probably. but the goal was to avoid collisions while using a hash method that was built into the .Net framework. this was a quick utility project. and really, calculating MD5 on 1K is definitely not the bottleneck here; the file scanning and open/read/close totally dominates the process - especially when those files are on a network drive somewhere.

                Randor wrote:

                Were you initially planning on hashing the entire file or something?

                actually, yes. (and that's what happens when the 1K scan finds matches). the partial hash pre-scan was a later addition, and i just reused the hash stuff that was already written instead of coming up with a new method.

                image processing toolkits | batch image processing

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups