Counting Different Strings
-
Hey I have a problem, I have a lot of files and each file contains a lot of strings with length of 4. There might be double strings in each file. I need a fast way to implement a check of how many different strings I have in all the files together. I can't load all the strings into memory, there are many strings. The number of different strings can arrive easily to 100 millions. Clint
-
Hey I have a problem, I have a lot of files and each file contains a lot of strings with length of 4. There might be double strings in each file. I need a fast way to implement a check of how many different strings I have in all the files together. I can't load all the strings into memory, there are many strings. The number of different strings can arrive easily to 100 millions. Clint
If you need a distinct count then you have to loop! You don't need to load the entire file into memory, you can use a stream to quickly scan the file. 100 million seems like a large estimate for the number of distinct 4 character strings. Are you really looking for an alphabet of 255 characters? (or worse if unicode)? If you really will have this many distinct records create a file of the number of distinct possible records * 4 bytes and store the count as an unsigned int in the file offset.
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
-
If you need a distinct count then you have to loop! You don't need to load the entire file into memory, you can use a stream to quickly scan the file. 100 million seems like a large estimate for the number of distinct 4 character strings. Are you really looking for an alphabet of 255 characters? (or worse if unicode)? If you really will have this many distinct records create a file of the number of distinct possible records * 4 bytes and store the count as an unsigned int in the file offset.
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
-
That's even a small estimate. I will have 2 on the power of 32 records maximum. I don't know if it is a good idea to build such a big file.
There is a difference between distinct records and records. If you want to count distinct records you must have a way of storing the count. 2^32 is roughly 255^4 which makes me wonder why it is even string data (they would more accurately be binary at this point) "Got Ram?"
clint1982 wrote:
I don't know if it is a good idea to build such a big file
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
-
There is a difference between distinct records and records. If you want to count distinct records you must have a way of storing the count. 2^32 is roughly 255^4 which makes me wonder why it is even string data (they would more accurately be binary at this point) "Got Ram?"
clint1982 wrote:
I don't know if it is a good idea to build such a big file
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
-
You are right, It's 4 character binary. I need an efficient way to count the number of distinct records,
Index the files then loop through to count. However, 4^255 is huge and you will not be able to store the results in ram so you will need a file structure just to store the results. Better if you could use small subsets at a time.
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
-
Index the files then loop through to count. However, 4^255 is huge and you will not be able to store the results in ram so you will need a file structure just to store the results. Better if you could use small subsets at a time.
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane
Build a self building huffman tree to get the counts. It would be slightly slower than an incremental loop but would occupy 50% less space unless the counts are uniform.
A man said to the universe: "Sir I exist!" "However," replied the Universe, "The fact has not created in me A sense of obligation." -- Stephen Crane