Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Algorithms
  4. How to Remove duplicate files rapidly?

How to Remove duplicate files rapidly?

Scheduled Pinned Locked Moved Algorithms
algorithmscryptographyperformancetutorialquestion
15 Posts 8 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H happy liu

    The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?

    M Offline
    M Offline
    Manfred Rudolf Bihy
    wrote on last edited by
    #4

    You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred

    "I had the right to remain silent, but I didn't have the ability!"

    Ron White, Comedian

    H S 2 Replies Last reply
    0
    • H happy liu

      The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?

      Kornfeld Eliyahu PeterK Offline
      Kornfeld Eliyahu PeterK Offline
      Kornfeld Eliyahu Peter
      wrote on last edited by
      #5

      It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them... If you have large number of files you have to rethink your approach. 1. use file system's FileInfo - name, size, creation, last modified and so 2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.

      I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

      "It never ceases to amaze me that a spacecraft launched in 1977 can be fixed remotely from Earth." ― Brian Cox

      H 1 Reply Last reply
      0
      • Kornfeld Eliyahu PeterK Kornfeld Eliyahu Peter

        It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them... If you have large number of files you have to rethink your approach. 1. use file system's FileInfo - name, size, creation, last modified and so 2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.

        I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

        H Offline
        H Offline
        happy liu
        wrote on last edited by
        #6

        Thank you V. ^_^ Your advice is good. In the beginning i had not a clear understanding of this question. Now i agree I/O should be the key point, and i wish i can find a way to resolve my problem. Thanks!

        Kornfeld Eliyahu PeterK 1 Reply Last reply
        0
        • M Manfred Rudolf Bihy

          You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred

          "I had the right to remain silent, but I didn't have the ability!"

          Ron White, Comedian

          H Offline
          H Offline
          happy liu
          wrote on last edited by
          #7

          Thanks for your advice. However i wonder why SHA2 should be a better choice which is slower than MD5 or MD4?

          M 1 Reply Last reply
          0
          • H happy liu

            Thanks for your advice. However i wonder why SHA2 should be a better choice which is slower than MD5 or MD4?

            M Offline
            M Offline
            Manfred Rudolf Bihy
            wrote on last edited by
            #8

            It's been developed to be more robust, i.e. the probability to have collisions is reduced. Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are. Cheers!

            "I had the right to remain silent, but I didn't have the ability!"

            Ron White, Comedian

            H 1 Reply Last reply
            0
            • M Manfred Rudolf Bihy

              It's been developed to be more robust, i.e. the probability to have collisions is reduced. Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are. Cheers!

              "I had the right to remain silent, but I didn't have the ability!"

              Ron White, Comedian

              H Offline
              H Offline
              happy liu
              wrote on last edited by
              #9

              Thanks! ^_^

              1 Reply Last reply
              0
              • H happy liu

                Thank you V. ^_^ Your advice is good. In the beginning i had not a clear understanding of this question. Now i agree I/O should be the key point, and i wish i can find a way to resolve my problem. Thanks!

                Kornfeld Eliyahu PeterK Offline
                Kornfeld Eliyahu PeterK Offline
                Kornfeld Eliyahu Peter
                wrote on last edited by
                #10

                Yes I/O is a key issue... If it is in your power you may move the storage to a more capable one, like SCSI or flash, that should speed up I/O...

                I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

                "It never ceases to amaze me that a spacecraft launched in 1977 can be fixed remotely from Earth." ― Brian Cox

                1 Reply Last reply
                0
                • H happy liu

                  The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?

                  C Offline
                  C Offline
                  Chris Losinger
                  wrote on last edited by
                  #11

                  i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.

                  image processing toolkits | batch image processing

                  Kornfeld Eliyahu PeterK M 2 Replies Last reply
                  0
                  • C Chris Losinger

                    i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.

                    image processing toolkits | batch image processing

                    Kornfeld Eliyahu PeterK Offline
                    Kornfeld Eliyahu PeterK Offline
                    Kornfeld Eliyahu Peter
                    wrote on last edited by
                    #12

                    You may give a polish to that code and publish it here as a tip or article...

                    I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).

                    "It never ceases to amaze me that a spacecraft launched in 1977 can be fixed remotely from Earth." ― Brian Cox

                    1 Reply Last reply
                    0
                    • C Chris Losinger

                      i wrote a little app to do this for my own use, and i used multiple tests: for every file to compare, find: 1. size - if they aren't the same size, they aren't the same file. 2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store. that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests. after that, calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes. after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.

                      image processing toolkits | batch image processing

                      M Offline
                      M Offline
                      Manoj Kumar Rai
                      wrote on last edited by
                      #13

                      I agree one can have mupltiple type of algorithms based on suitability: 1. check for size if does not matches then those files are not same. 2. Compare 2% character from begining and then from end if those does not matches then files are not same. 3. Finally it can be once option to simply compare the remaining part of both files or use CRC, checksum, MD SHA. If MD and SHA or not already calculated and stored then those will be a bit costly.

                      Manoj Never Gives up

                      1 Reply Last reply
                      0
                      • H happy liu

                        The current thinking is to calculate hash for every file ,and then delete them. Now we can use MD5 or SHA1 to calculate hash.From the speed point of view, MD4 is a better choice when MD4 is faster but more unsafer.But after experiment, i found even md4 is so slow that it would cost a lot of time when i dealt with large files. So any algorithm to work faster?

                        T Offline
                        T Offline
                        theafien
                        wrote on last edited by
                        #14

                        store information as bytes size and simple crc code. performe loop and initially check bytes size if equal performe crc or hash check. Or too do hash checking only when bytes then equal storing hash for future use.

                        1 Reply Last reply
                        0
                        • M Manfred Rudolf Bihy

                          You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match. Best, Manfred

                          "I had the right to remain silent, but I didn't have the ability!"

                          Ron White, Comedian

                          S Offline
                          S Offline
                          saephoed
                          wrote on last edited by
                          #15

                          i found a similar solution when i had to recover 4.5TB of files mostly between .5 - 15GB where around 1-2% of them were early terminated during a copy process, but left with correct filesize filled up with zeroes. it adds to your suggestion to scan all big files with big fixed and some random jumps up to their ends.

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups