Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Counting Different Strings

Counting Different Strings

Scheduled Pinned Locked Moved C#
performancehelp
7 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C Offline
    C Offline
    clint1982
    wrote on last edited by
    #1

    Hey I have a problem, I have a lot of files and each file contains a lot of strings with length of 4. There might be double strings in each file. I need a fast way to implement a check of how many different strings I have in all the files together. I can't load all the strings into memory, there are many strings. The number of different strings can arrive easily to 100 millions. Clint

    E 1 Reply Last reply
    0
    • C clint1982

      Hey I have a problem, I have a lot of files and each file contains a lot of strings with length of 4. There might be double strings in each file. I need a fast way to implement a check of how many different strings I have in all the files together. I can't load all the strings into memory, there are many strings. The number of different strings can arrive easily to 100 millions. Clint

      E Offline
      E Offline
      Ennis Ray Lynch Jr
      wrote on last edited by
      #2

      If you need a distinct count then you have to loop! You don't need to load the entire file into memory, you can use a stream to quickly scan the file. 100 million seems like a large estimate for the number of distinct 4 character strings. Are you really looking for an alphabet of 255 characters? (or worse if unicode)? If you really will have this many distinct records create a file of the number of distinct possible records * 4 bytes and store the count as an unsigned int in the file offset.

      A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

      C 1 Reply Last reply
      0
      • E Ennis Ray Lynch Jr

        If you need a distinct count then you have to loop! You don't need to load the entire file into memory, you can use a stream to quickly scan the file. 100 million seems like a large estimate for the number of distinct 4 character strings. Are you really looking for an alphabet of 255 characters? (or worse if unicode)? If you really will have this many distinct records create a file of the number of distinct possible records * 4 bytes and store the count as an unsigned int in the file offset.

        A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

        C Offline
        C Offline
        clint1982
        wrote on last edited by
        #3

        That's even a small estimate. I will have 2 on the power of 32 records maximum. I don't know if it is a good idea to build such a big file.

        E 1 Reply Last reply
        0
        • C clint1982

          That's even a small estimate. I will have 2 on the power of 32 records maximum. I don't know if it is a good idea to build such a big file.

          E Offline
          E Offline
          Ennis Ray Lynch Jr
          wrote on last edited by
          #4

          There is a difference between distinct records and records. If you want to count distinct records you must have a way of storing the count. 2^32 is roughly 255^4 which makes me wonder why it is even string data (they would more accurately be binary at this point) "Got Ram?"

          clint1982 wrote:

          I don't know if it is a good idea to build such a big file

          A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

          C 1 Reply Last reply
          0
          • E Ennis Ray Lynch Jr

            There is a difference between distinct records and records. If you want to count distinct records you must have a way of storing the count. 2^32 is roughly 255^4 which makes me wonder why it is even string data (they would more accurately be binary at this point) "Got Ram?"

            clint1982 wrote:

            I don't know if it is a good idea to build such a big file

            A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

            C Offline
            C Offline
            clint1982
            wrote on last edited by
            #5

            You are right, It's 4 character binary. I need an efficient way to count the number of distinct records,

            E 1 Reply Last reply
            0
            • C clint1982

              You are right, It's 4 character binary. I need an efficient way to count the number of distinct records,

              E Offline
              E Offline
              Ennis Ray Lynch Jr
              wrote on last edited by
              #6

              Index the files then loop through to count. However, 4^255 is huge and you will not be able to store the results in ram so you will need a file structure just to store the results. Better if you could use small subsets at a time.

              A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

              E 1 Reply Last reply
              0
              • E Ennis Ray Lynch Jr

                Index the files then loop through to count. However, 4^255 is huge and you will not be able to store the results in ram so you will need a file structure just to store the results. Better if you could use small subsets at a time.

                A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

                E Offline
                E Offline
                Ennis Ray Lynch Jr
                wrote on last edited by
                #7

                Build a self building huffman tree to get the counts. It would be slightly slower than an incremental loop but would occupy 50% less space unless the counts are uniform.

                A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups