Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. How to search for a specific string in a file?

How to search for a specific string in a file?

Scheduled Pinned Locked Moved C#
helptutorialquestiondata-structuresjson
16 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Offline
    S Offline
    SimpleData
    wrote on last edited by
    #1

    Hi, Let's say that I've a file and I would like to find a specific string's location inside it. It is logical to load my file into a byte array and then parse it into a string or a StringBuilder byte by byte, right? It is an easy issue if the file is small. If the file is huge (like 400megabytes), it is not that easy. The application demands lots of RAM and it is annoying. I thought about copying the file into the byte array by 100 bytes, step by step. But this brings another problem, if the half of my string is inside the first 100 bytes, and the rest in the next 100 bytes; my application is not able to identify my string and therefore give its location to me. How can I solve this, any ideas? If there is anything unclear, feel free to ask. Thanks.

    S L S 3 Replies Last reply
    0
    • S SimpleData

      Hi, Let's say that I've a file and I would like to find a specific string's location inside it. It is logical to load my file into a byte array and then parse it into a string or a StringBuilder byte by byte, right? It is an easy issue if the file is small. If the file is huge (like 400megabytes), it is not that easy. The application demands lots of RAM and it is annoying. I thought about copying the file into the byte array by 100 bytes, step by step. But this brings another problem, if the half of my string is inside the first 100 bytes, and the rest in the next 100 bytes; my application is not able to identify my string and therefore give its location to me. How can I solve this, any ideas? If there is anything unclear, feel free to ask. Thanks.

      S Offline
      S Offline
      StarBP
      wrote on last edited by
      #2

      Test it twice. First test it like usual, then test it again with the divisions offset by exactly half the batch size. As long as each batch is at least twice the size of the search string, you will be guaranteed to find the string in at least one of the two searches. More specifically, the offset MUST be less than or equal to Half the batch size AND greater than or equal to the size of the string being searched for.

      S 1 Reply Last reply
      0
      • S StarBP

        Test it twice. First test it like usual, then test it again with the divisions offset by exactly half the batch size. As long as each batch is at least twice the size of the search string, you will be guaranteed to find the string in at least one of the two searches. More specifically, the offset MUST be less than or equal to Half the batch size AND greater than or equal to the size of the string being searched for.

        S Offline
        S Offline
        SimpleData
        wrote on last edited by
        #3

        I didn't exactly able to understand what you mean. I got the main idea but I still have the problem with the ratios. My string is 17 characters long. Therefore what should be my search lengths?

        S 1 Reply Last reply
        0
        • S SimpleData

          I didn't exactly able to understand what you mean. I got the main idea but I still have the problem with the ratios. My string is 17 characters long. Therefore what should be my search lengths?

          S Offline
          S Offline
          StarBP
          wrote on last edited by
          #4

          100 bytes is just fine. Search once with a stride of 100 bytes starting at zero-indexed byte 0, then search with a stride of 100 bytes starting at zero-indexed byte 50. This works with strings up to 50 bytes in length.

          S 1 Reply Last reply
          0
          • S SimpleData

            Hi, Let's say that I've a file and I would like to find a specific string's location inside it. It is logical to load my file into a byte array and then parse it into a string or a StringBuilder byte by byte, right? It is an easy issue if the file is small. If the file is huge (like 400megabytes), it is not that easy. The application demands lots of RAM and it is annoying. I thought about copying the file into the byte array by 100 bytes, step by step. But this brings another problem, if the half of my string is inside the first 100 bytes, and the rest in the next 100 bytes; my application is not able to identify my string and therefore give its location to me. How can I solve this, any ideas? If there is anything unclear, feel free to ask. Thanks.

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            So what you want to do, is to find only 1 string once? Because then you could just stream the file, no buffers needed. [almost] To visualize this (in case anyone needs the explanation); you can look at it like a state machine, with a state for every character in the string you are searching for (lets call it X). The state machine will keep reading 1 character at a time until it exits. It starts in state 0, and will change to state 1 if it reads a character that matches X[0]. State n will transition to state n+1 if it reads X[n] or to state 0 if it reads anything else. The last state will return the position in the file if it reads X[last]. Every state will return "not found" if it runs out of characters to read. [/but not quite] Solution: use the Knuth-Morris-Pratt algorithm, at worst it has to evaluate the same character multiple times, but it never needs to go back (only forward, but skipping characters in a stream is trivial) Essentially it's doing the same as what I described, but it jumps back to the right state instead of always 0. That way you only need a constant amount of memory plus the string X.

            modified on Sunday, March 7, 2010 9:38 AM

            S L 2 Replies Last reply
            0
            • L Lost User

              So what you want to do, is to find only 1 string once? Because then you could just stream the file, no buffers needed. [almost] To visualize this (in case anyone needs the explanation); you can look at it like a state machine, with a state for every character in the string you are searching for (lets call it X). The state machine will keep reading 1 character at a time until it exits. It starts in state 0, and will change to state 1 if it reads a character that matches X[0]. State n will transition to state n+1 if it reads X[n] or to state 0 if it reads anything else. The last state will return the position in the file if it reads X[last]. Every state will return "not found" if it runs out of characters to read. [/but not quite] Solution: use the Knuth-Morris-Pratt algorithm, at worst it has to evaluate the same character multiple times, but it never needs to go back (only forward, but skipping characters in a stream is trivial) Essentially it's doing the same as what I described, but it jumps back to the right state instead of always 0. That way you only need a constant amount of memory plus the string X.

              modified on Sunday, March 7, 2010 9:38 AM

              S Offline
              S Offline
              SimpleData
              wrote on last edited by
              #6

              It is a great idea. How come I've never thought it. Thanks. :)

              L 1 Reply Last reply
              0
              • S StarBP

                100 bytes is just fine. Search once with a stride of 100 bytes starting at zero-indexed byte 0, then search with a stride of 100 bytes starting at zero-indexed byte 50. This works with strings up to 50 bytes in length.

                S Offline
                S Offline
                SimpleData
                wrote on last edited by
                #7

                Thanks.

                1 Reply Last reply
                0
                • S SimpleData

                  It is a great idea. How come I've never thought it. Thanks. :)

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #8

                  You're welcome :)

                  1 Reply Last reply
                  0
                  • L Lost User

                    So what you want to do, is to find only 1 string once? Because then you could just stream the file, no buffers needed. [almost] To visualize this (in case anyone needs the explanation); you can look at it like a state machine, with a state for every character in the string you are searching for (lets call it X). The state machine will keep reading 1 character at a time until it exits. It starts in state 0, and will change to state 1 if it reads a character that matches X[0]. State n will transition to state n+1 if it reads X[n] or to state 0 if it reads anything else. The last state will return the position in the file if it reads X[last]. Every state will return "not found" if it runs out of characters to read. [/but not quite] Solution: use the Knuth-Morris-Pratt algorithm, at worst it has to evaluate the same character multiple times, but it never needs to go back (only forward, but skipping characters in a stream is trivial) Essentially it's doing the same as what I described, but it jumps back to the right state instead of always 0. That way you only need a constant amount of memory plus the string X.

                    modified on Sunday, March 7, 2010 9:38 AM

                    L Offline
                    L Offline
                    Luc Pattyn
                    wrote on last edited by
                    #9

                    You do know it gets a little bit more complex when the characters in the search string aren't all different, as in: find "anas" in "a long text containing ananas and other stuff"; returning to state 0 isn't always right. :)

                    Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                    I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                    L S 2 Replies Last reply
                    0
                    • L Luc Pattyn

                      You do know it gets a little bit more complex when the characters in the search string aren't all different, as in: find "anas" in "a long text containing ananas and other stuff"; returning to state 0 isn't always right. :)

                      Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                      I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                      L Offline
                      L Offline
                      Lost User
                      wrote on last edited by
                      #10

                      Ok fine, spoil the fun :) So maybe it isn't 0 but more like X.LastIndexOf(char that was read) (or 0 if it wasn't found) - or if it isn't that I might actually have to think and it's the weekend so no thanks (exercise for the reader?) My brain got unlazy for a second and remembered the solution - see the edit..

                      modified on Saturday, March 6, 2010 9:56 PM

                      1 Reply Last reply
                      0
                      • S SimpleData

                        Hi, Let's say that I've a file and I would like to find a specific string's location inside it. It is logical to load my file into a byte array and then parse it into a string or a StringBuilder byte by byte, right? It is an easy issue if the file is small. If the file is huge (like 400megabytes), it is not that easy. The application demands lots of RAM and it is annoying. I thought about copying the file into the byte array by 100 bytes, step by step. But this brings another problem, if the half of my string is inside the first 100 bytes, and the rest in the next 100 bytes; my application is not able to identify my string and therefore give its location to me. How can I solve this, any ideas? If there is anything unclear, feel free to ask. Thanks.

                        S Offline
                        S Offline
                        Shane5555
                        wrote on last edited by
                        #11

                        Similar to StarBP: use 2 rolling buffers which are the size of the search terms. initialize by loading data into buffer 1 repeat the following steps: clear buffer 2 dump buffer 1 into buffer 2 load fresh data into buffer 1 combine the buffers search the combination If the buffers are the correct size your search terms will always be in one combination. The overhead is that you will be searching each buffer twice though. hope it helps Shane

                        1 Reply Last reply
                        0
                        • L Luc Pattyn

                          You do know it gets a little bit more complex when the characters in the search string aren't all different, as in: find "anas" in "a long text containing ananas and other stuff"; returning to state 0 isn't always right. :)

                          Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                          I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                          S Offline
                          S Offline
                          SimpleData
                          wrote on last edited by
                          #12

                          Yes, but I think that is not a problem for me. I am looking for a string in the file, no matter where it is. I think code can express everything, in a better way. Here is my code:

                          private long DigBinary(string file, string strToDig)
                          {
                          FileStream fs = null;

                                  char\[\] chAim = strToDig.ToCharArray();
                                  char chTemp = '0';
                                  long latestHitBeginningLocation = 0;
                                  int locationInArray = 0;
                          
                                  try { fs = new FileStream(file, FileMode.Open, FileAccess.Read); }
                                  catch { throw new Exception("An error occured while creating the stream."); }
                          
                                  try
                                  {
                                      while (locationInArray < chAim.Length)
                                      {
                                          chTemp = (char)fs.ReadByte();
                          
                                          if( chTemp != chAim\[locationInArray\] )
                                              locationInArray = 0;
                          
                                          if (chTemp == chAim\[locationInArray\])
                                          {
                                              if (locationInArray == 0)
                                                  latestHitBeginningLocation = fs.Position - 1;
                          
                                              if (locationInArray == chAim.Length)
                                                  break;
                          
                                              locationInArray++;
                                          }
                                          else
                                          {
                                              locationInArray = 0;
                                              latestHitBeginningLocation = 0;
                                          }
                                      }
                                  }
                                  catch { throw new Exception("An error occured while reading the file."); }
                                  finally { if (fs != null) { fs.Close(); fs.Dispose(); } }
                          
                                  return latestHitBeginningLocation;
                              }
                          

                          And yes, I know that my try-catch is useless. :D

                          L 1 Reply Last reply
                          0
                          • S SimpleData

                            Yes, but I think that is not a problem for me. I am looking for a string in the file, no matter where it is. I think code can express everything, in a better way. Here is my code:

                            private long DigBinary(string file, string strToDig)
                            {
                            FileStream fs = null;

                                    char\[\] chAim = strToDig.ToCharArray();
                                    char chTemp = '0';
                                    long latestHitBeginningLocation = 0;
                                    int locationInArray = 0;
                            
                                    try { fs = new FileStream(file, FileMode.Open, FileAccess.Read); }
                                    catch { throw new Exception("An error occured while creating the stream."); }
                            
                                    try
                                    {
                                        while (locationInArray < chAim.Length)
                                        {
                                            chTemp = (char)fs.ReadByte();
                            
                                            if( chTemp != chAim\[locationInArray\] )
                                                locationInArray = 0;
                            
                                            if (chTemp == chAim\[locationInArray\])
                                            {
                                                if (locationInArray == 0)
                                                    latestHitBeginningLocation = fs.Position - 1;
                            
                                                if (locationInArray == chAim.Length)
                                                    break;
                            
                                                locationInArray++;
                                            }
                                            else
                                            {
                                                locationInArray = 0;
                                                latestHitBeginningLocation = 0;
                                            }
                                        }
                                    }
                                    catch { throw new Exception("An error occured while reading the file."); }
                                    finally { if (fs != null) { fs.Close(); fs.Dispose(); } }
                            
                                    return latestHitBeginningLocation;
                                }
                            

                            And yes, I know that my try-catch is useless. :D

                            L Offline
                            L Offline
                            Luc Pattyn
                            wrote on last edited by
                            #13

                            that is a horrible piece of "code". X| X| X|

                            Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                            I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                            S 1 Reply Last reply
                            0
                            • L Luc Pattyn

                              that is a horrible piece of "code". X| X| X|

                              Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                              I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                              S Offline
                              S Offline
                              SimpleData
                              wrote on last edited by
                              #14

                              I am open to suggestions.

                              L 1 Reply Last reply
                              0
                              • S SimpleData

                                I am open to suggestions.

                                L Offline
                                L Offline
                                Luc Pattyn
                                wrote on last edited by
                                #15

                                Here are some: - unspecified catch = deadly sin - store actual exception as inner exception in functional exception - user-generated exceptions should inherit from ApplicationException - two try blocks where one would suffice - redundant chTemp initialization - should use using statement - char[] chAim = strToDig.ToCharArray(); is redundant; use strToDig[index] And the algorithm is wrong, as I reported earlier. :)

                                Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                                I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                                S 1 Reply Last reply
                                0
                                • L Luc Pattyn

                                  Here are some: - unspecified catch = deadly sin - store actual exception as inner exception in functional exception - user-generated exceptions should inherit from ApplicationException - two try blocks where one would suffice - redundant chTemp initialization - should use using statement - char[] chAim = strToDig.ToCharArray(); is redundant; use strToDig[index] And the algorithm is wrong, as I reported earlier. :)

                                  Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]


                                  I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.


                                  S Offline
                                  S Offline
                                  SimpleData
                                  wrote on last edited by
                                  #16

                                  Thanks for the advices. I will change the code accordingly. This algorithm covers my needs. It works, it is fast and it doesn't consume much RAM. :)

                                  1 Reply Last reply
                                  0
                                  Reply
                                  • Reply as topic
                                  Log in to reply
                                  • Oldest to Newest
                                  • Newest to Oldest
                                  • Most Votes


                                  • Login

                                  • Don't have an account? Register

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • World
                                  • Users
                                  • Groups