Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. File compare - CRC?

File compare - CRC?

Scheduled Pinned Locked Moved C#
csharpdatabasequestiondiscussion
9 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    jbradshaw
    wrote on last edited by
    #1

    I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

    S L L P 4 Replies Last reply
    0
    • J jbradshaw

      I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

      S Offline
      S Offline
      Saksida Bojan
      wrote on last edited by
      #2

      Using CRC is one possibly solution. But you need to know, that sometimes CRC could be same, with a different file content. CRC is the fastest solution.

      jbradshaw wrote:

      I could do a byte by byte comparison but that seems like it would be slower.

      Byte to byte is super slow. Compare with a blocks of bytes. For example: Read first file with 1024 bytes, read second file with 1024 bytes, then compare. Newer do 1 Byte comparison, you would "kill system" PS: The only Short cut can be by comparing length

      modified on Wednesday, December 2, 2009 12:07 PM

      1 Reply Last reply
      0
      • J jbradshaw

        I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #3

        It depends, if the files can be edited by a malicious user then a CRC will not be good enough since it's very easy to calculate collisions for them. SHA2 would be OK, but there is still a chance that two different files will accidentally have the same hash (which follows from the fact that there are more different files than there are hashes, since the hash has a small fixed length), that chance is rather small for meaningful files though. If it's very extremely important that the files are actually the same (with zero chance on false positives) then there are no shortcuts and you'd have to compare them byte-by-byte (edit: but of course you should read the file block-by-block, as said above, it would be rather braindead to read just a single byte many times in a loop)

        1 Reply Last reply
        0
        • J jbradshaw

          I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

          L Offline
          L Offline
          Luc Pattyn
          wrote on last edited by
          #4

          You have to define when files are the same to you. In one way, if two files exist, they are always different: they may have identical creation date, modification date, length, and content; however when they have identical names, they are residing in different folders or partitions. So be more specific. Once defined, you can perform identity checking by checking the attributes that are relevant to your definition; for content it is wise to calculate (and probably store) some kind of hash, their is an infinite number of definitions and algorithms; Windows Explorer itself holds one 32-bit CRC for file content; ZIP files hold another one. Hashes and CRCs will be identical when content is identical, and they are very likely to be different for different content; when that isn't good enough, you need to compare all the bytes. :)

          Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


          I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages


          1 Reply Last reply
          0
          • J jbradshaw

            I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #5

            A byte-by-byte compare will likely take less time than calculating two CRCs or hashes (especially if the compare returns false early), so the only gain is if you store the CRC or hash for later. But you need to be sure that the file hasn't changed since you calculated its CRC or hash.

            jbradshaw wrote:

            keeping a list of files in a database

            Are you storing the actual file content? Or just the path and other information? If you only store the path I wouldn't trust that the file has not been changed (or even deleted) so I wouldn't bother storing the CRC or hash. Personally, I store the file content and a SHA1 hash. <Anecdote> My GenOmatic[^] generates a file and I want to know if the new file matches the previous version. I considered storing a hash in the file and comparing, but quickly decided that it was too unreliable so I went with a string compare. </Anecdote>

            J 1 Reply Last reply
            0
            • P PIEBALDconsult

              A byte-by-byte compare will likely take less time than calculating two CRCs or hashes (especially if the compare returns false early), so the only gain is if you store the CRC or hash for later. But you need to be sure that the file hasn't changed since you calculated its CRC or hash.

              jbradshaw wrote:

              keeping a list of files in a database

              Are you storing the actual file content? Or just the path and other information? If you only store the path I wouldn't trust that the file has not been changed (or even deleted) so I wouldn't bother storing the CRC or hash. Personally, I store the file content and a SHA1 hash. <Anecdote> My GenOmatic[^] generates a file and I want to know if the new file matches the previous version. I considered storing a hash in the file and comparing, but quickly decided that it was too unreliable so I went with a string compare. </Anecdote>

              J Offline
              J Offline
              jbradshaw
              wrote on last edited by
              #6

              I know the files won't change once I have calculated the CRC for them. All I really need is a unique number on the contents of the file so that later if I have two files, I can compare those numbers instead of having to do a byte by byte comparison. Here's what I'm trying to do in a nutshell. Somebody uploads a file to my website. I store the file for long term storage ( just go with it ). Somebody uploads another file to my website. If the file is the same as any of the files I've already stored, don't store it again otherwise store the file. Somebody uploads another file to my website. If the file is the same as any of the files.... So really I want to make sure I only have distinct files on the server. I don't have a problem with it being uploaded multiple times, I'd just like to have it so that the file is retained(stored) only once. I figured by doing some kind of CRC check, I could get the CRC when the file is uploaded and then check the DB for any other files with the CRC. If not, add it to the DB with the CRC and go one. If the file already exists, don't bother saving it and delete the file from the upload area. TIA - Jeff.

              P 1 Reply Last reply
              0
              • J jbradshaw

                I know the files won't change once I have calculated the CRC for them. All I really need is a unique number on the contents of the file so that later if I have two files, I can compare those numbers instead of having to do a byte by byte comparison. Here's what I'm trying to do in a nutshell. Somebody uploads a file to my website. I store the file for long term storage ( just go with it ). Somebody uploads another file to my website. If the file is the same as any of the files I've already stored, don't store it again otherwise store the file. Somebody uploads another file to my website. If the file is the same as any of the files.... So really I want to make sure I only have distinct files on the server. I don't have a problem with it being uploaded multiple times, I'd just like to have it so that the file is retained(stored) only once. I figured by doing some kind of CRC check, I could get the CRC when the file is uploaded and then check the DB for any other files with the CRC. If not, add it to the DB with the CRC and go one. If the file already exists, don't bother saving it and delete the file from the upload area. TIA - Jeff.

                P Offline
                P Offline
                PIEBALDconsult
                wrote on last edited by
                #7

                Makes sense to me; I'd store a hash and file length.

                J 1 Reply Last reply
                0
                • P PIEBALDconsult

                  Makes sense to me; I'd store a hash and file length.

                  J Offline
                  J Offline
                  jbradshaw
                  wrote on last edited by
                  #8

                  Can anybody suggest or point me to a hash routine? TIA - Jeff.

                  P 1 Reply Last reply
                  0
                  • J jbradshaw

                    Can anybody suggest or point me to a hash routine? TIA - Jeff.

                    P Offline
                    P Offline
                    PIEBALDconsult
                    wrote on last edited by
                    #9

                    Here's what I use, the result is a forty-character string:

                        public static string
                        Hash
                        (
                            string Subject
                        )
                        {
                            return ( Hash 
                            ( 
                                System.Text.Encoding.Unicode.GetBytes ( Subject ) 
                            , 
                                new System.Security.Cryptography.SHA1Managed() 
                            ) ) ;
                        }
                    
                        public static string
                        Hash
                        (
                            byte\[\]                                     Subject
                        ,
                            System.Security.Cryptography.HashAlgorithm Provider
                        )
                        {
                            System.Text.StringBuilder result = 
                                new System.Text.StringBuilder ( Provider.OutputBlockSize ) ;
                    
                            foreach 
                            ( 
                                byte b 
                            in 
                                Provider.ComputeHash ( Subject ) 
                            )
                            {
                                result.Append ( b.ToString ( "X2" ) ) ;
                            }
                    
                            return ( result.ToString() ) ;
                        }
                    

                    You can choose another hash algorithm and you don't have to convert it to a string.

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups