File compare - CRC?

jbradshaw

I have a system where I need to compare files to see if they are the same. Is doing something like CRC the fastest/best way? I could do a byte by byte comparison but that seems like it would be slower. I'm going to be keeping a list of files in a database so if I could come up with a number I generate just once, that would be great because I could store it in the database and then compare from then on (I'll be adding files to the list as time goes on so I need to compare them as it goes.) I'm writing in C# v2.0 (although if I have to I might be able to go to 3.5). Any thoughts would be appreciated. TIA - Jeff.

Saksida Bojan

Using CRC is one possibly solution. But you need to know, that sometimes CRC could be same, with a different file content. CRC is the fastest solution.

jbradshaw wrote:

I could do a byte by byte comparison but that seems like it would be slower.

Byte to byte is super slow. Compare with a blocks of bytes. For example: Read first file with 1024 bytes, read second file with 1024 bytes, then compare. Newer do 1 Byte comparison, you would "kill system" PS: The only Short cut can be by comparing length

modified on Wednesday, December 2, 2009 12:07 PM

Lost User

It depends, if the files can be edited by a malicious user then a CRC will not be good enough since it's very easy to calculate collisions for them. SHA2 would be OK, but there is still a chance that two different files will accidentally have the same hash (which follows from the fact that there are more different files than there are hashes, since the hash has a small fixed length), that chance is rather small for meaningful files though. If it's very extremely important that the files are actually the same (with zero chance on false positives) then there are no shortcuts and you'd have to compare them byte-by-byte (edit: but of course you should read the file block-by-block, as said above, it would be rather braindead to read just a single byte many times in a loop)

Luc Pattyn

You have to define when files are the same to you. In one way, if two files exist, they are always different: they may have identical creation date, modification date, length, and content; however when they have identical names, they are residing in different folders or partitions. So be more specific. Once defined, you can perform identity checking by checking the attributes that are relevant to your definition; for content it is wise to calculate (and probably store) some kind of hash, their is an infinite number of definitions and algorithms; Windows Explorer itself holds one 32-bit CRC for file content; ZIP files hold another one. Hashes and CRCs will be identical when content is identical, and they are very likely to be different for different content; when that isn't good enough, you need to compare all the bytes. :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]

I only read code that is properly indented, and rendered in a non-proportional font; hint: use PRE tags in forum messages

PIEBALDconsult

A byte-by-byte compare will likely take less time than calculating two CRCs or hashes (especially if the compare returns false early), so the only gain is if you store the CRC or hash for later. But you need to be sure that the file hasn't changed since you calculated its CRC or hash.

jbradshaw wrote:

keeping a list of files in a database

Are you storing the actual file content? Or just the path and other information? If you only store the path I wouldn't trust that the file has not been changed (or even deleted) so I wouldn't bother storing the CRC or hash. Personally, I store the file content and a SHA1 hash. <Anecdote> My GenOmatic[^] generates a file and I want to know if the new file matches the previous version. I considered storing a hash in the file and comparing, but quickly decided that it was too unreliable so I went with a string compare. </Anecdote>

jbradshaw

I know the files won't change once I have calculated the CRC for them. All I really need is a unique number on the contents of the file so that later if I have two files, I can compare those numbers instead of having to do a byte by byte comparison. Here's what I'm trying to do in a nutshell. Somebody uploads a file to my website. I store the file for long term storage ( just go with it ). Somebody uploads another file to my website. If the file is the same as any of the files I've already stored, don't store it again otherwise store the file. Somebody uploads another file to my website. If the file is the same as any of the files.... So really I want to make sure I only have distinct files on the server. I don't have a problem with it being uploaded multiple times, I'd just like to have it so that the file is retained(stored) only once. I figured by doing some kind of CRC check, I could get the CRC when the file is uploaded and then check the DB for any other files with the CRC. If not, add it to the DB with the CRC and go one. If the file already exists, don't bother saving it and delete the file from the upload area. TIA - Jeff.

PIEBALDconsult

Makes sense to me; I'd store a hash and file length.

jbradshaw

Can anybody suggest or point me to a hash routine? TIA - Jeff.

PIEBALDconsult

Here's what I use, the result is a forty-character string:

    public static string
    Hash
    (
        string Subject
    )
    {
        return ( Hash 
        ( 
            System.Text.Encoding.Unicode.GetBytes ( Subject ) 
        , 
            new System.Security.Cryptography.SHA1Managed() 
        ) ) ;
    }

    public static string
    Hash
    (
        byte\[\]                                     Subject
    ,
        System.Security.Cryptography.HashAlgorithm Provider
    )
    {
        System.Text.StringBuilder result = 
            new System.Text.StringBuilder ( Provider.OutputBlockSize ) ;

        foreach 
        ( 
            byte b 
        in 
            Provider.ComputeHash ( Subject ) 
        )
        {
            result.Append ( b.ToString ( "X2" ) ) ;
        }

        return ( result.ToString() ) ;
    }

You can choose another hash algorithm and you don't have to convert it to a string.