Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How to speed up copy of .txt files into arrays?

How to speed up copy of .txt files into arrays?

Scheduled Pinned Locked Moved C / C++ / MFC
tutorialc++cssperformancehelp
16 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Arris74

    Hello I use a Text file with following structure with about 4 millions lines and the separator is a ‘\t’ (tabulation). "Date" "Time" "Signal" 20000103 1658 351 20000103 1659 352 20000103 1700 350 20000103 1701 352 20000103 1702 355 20000104 0900 354 20000104 0901 352 20000104 0902 350 I would like to copy the columns Date, Time and Signal into a STL vectors containers. But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000. The criteria to define the line where to start and finish copying is the column Date and Time I currently use CFile and I copy the whole file into a buffer and I close the file. This is very fast, less than 1 second. Then I use the function strtok to read the buffer string by string. And I count the lines until I find the date and time thatI want. Unfortunately my method takes 57 seconds to cross all the file just to pick up the start and end line numbers. So I am wondering if is there is a possibility to read the buffer or a file column by column? Does someone knows or heard a method to swiftly read a string buffer or a file? I thank you very much If you can advice solutions that help to speed up this process. For information, I tested the function Tokenize of CString but is is very slow. I also tested CStdioFile and ReadString function but it is also very slow.

    L Offline
    L Offline
    led mike
    wrote on last edited by
    #2

    Based on the sample data you posted the file format is "fixed length" therefore simple math with give you a location to the exact record you want to access.

    led mike

    A 1 Reply Last reply
    0
    • L led mike

      Based on the sample data you posted the file format is "fixed length" therefore simple math with give you a location to the exact record you want to access.

      led mike

      A Offline
      A Offline
      Arris74
      wrote on last edited by
      #3

      Hi, simple math with give you a location to the exact record I am not sure to understand your solution. Could you please clarify ?

      1 Reply Last reply
      0
      • A Arris74

        Hello I use a Text file with following structure with about 4 millions lines and the separator is a ‘\t’ (tabulation). "Date" "Time" "Signal" 20000103 1658 351 20000103 1659 352 20000103 1700 350 20000103 1701 352 20000103 1702 355 20000104 0900 354 20000104 0901 352 20000104 0902 350 I would like to copy the columns Date, Time and Signal into a STL vectors containers. But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000. The criteria to define the line where to start and finish copying is the column Date and Time I currently use CFile and I copy the whole file into a buffer and I close the file. This is very fast, less than 1 second. Then I use the function strtok to read the buffer string by string. And I count the lines until I find the date and time thatI want. Unfortunately my method takes 57 seconds to cross all the file just to pick up the start and end line numbers. So I am wondering if is there is a possibility to read the buffer or a file column by column? Does someone knows or heard a method to swiftly read a string buffer or a file? I thank you very much If you can advice solutions that help to speed up this process. For information, I tested the function Tokenize of CString but is is very slow. I also tested CStdioFile and ReadString function but it is also very slow.

        D Offline
        D Offline
        David Crow
        wrote on last edited by
        #4

        Arris7 wrote:

        But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.

        Use the CFile::Seek() method to go to offset 250000*length_of_line. Without testing, this is 4500000.

        Arris7 wrote:

        I currently use CFile and I copy the whole file into a buffer and I close the file.

        I would consider using CStdioFile with CMemFile. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.


        "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

        "Judge not by the eye but by the heart." - Native American Proverb

        A 1 Reply Last reply
        0
        • D David Crow

          Arris7 wrote:

          But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.

          Use the CFile::Seek() method to go to offset 250000*length_of_line. Without testing, this is 4500000.

          Arris7 wrote:

          I currently use CFile and I copy the whole file into a buffer and I close the file.

          I would consider using CStdioFile with CMemFile. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.


          "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

          "Judge not by the eye but by the heart." - Native American Proverb

          A Offline
          A Offline
          Arris74
          wrote on last edited by
          #5

          Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

          L D 2 Replies Last reply
          0
          • A Arris74

            Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

            L Offline
            L Offline
            led mike
            wrote on last edited by
            #6

            Arris7 wrote:

            actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

            That contradicts your first post:

            Arris7 wrote:

            For example a just need to start at line 250000 and stop at line 3000000

            So which is it?

            led mike

            A 1 Reply Last reply
            0
            • A Arris74

              Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

              D Offline
              D Offline
              David Crow
              wrote on last edited by
              #7

              Arris7 wrote:

              CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

              Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

              void main( void )
              {
              char *szBuffer = "20000103\t1658\t351\n"
              "20000103\t1659\t352\n"
              "20000103\t1700\t350\n"
              "20000103\t1701\t352\n"
              "20000103\t1702\t355\n"
              "20000104\t0900\t354\n"
              "20000104\t0901\t352\n"
              "20000104\t0902\t350\n";
              char *p = szBuffer;

              while (p != NULL && \*p != '\\0')
              {
                  if (strncmp(p, "20000104\\t0900", 13) == 0)
                  {
                      printf("Found it!\\n");
                      break;;
                  }
              
                  // advance to the next 'line'
                  p += 18;
              }   
              

              }


              "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

              "Judge not by the eye but by the heart." - Native American Proverb

              A PJ ArendsP 2 Replies Last reply
              0
              • L led mike

                Arris7 wrote:

                actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

                That contradicts your first post:

                Arris7 wrote:

                For example a just need to start at line 250000 and stop at line 3000000

                So which is it?

                led mike

                A Offline
                A Offline
                Arris74
                wrote on last edited by
                #8

                Sorry I was not clear in my first post. Actually I just know the date and the time where to start. for example the starting date and time can be located at the line 250000 and I have to find the line.

                M 1 Reply Last reply
                0
                • D David Crow

                  Arris7 wrote:

                  CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

                  Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

                  void main( void )
                  {
                  char *szBuffer = "20000103\t1658\t351\n"
                  "20000103\t1659\t352\n"
                  "20000103\t1700\t350\n"
                  "20000103\t1701\t352\n"
                  "20000103\t1702\t355\n"
                  "20000104\t0900\t354\n"
                  "20000104\t0901\t352\n"
                  "20000104\t0902\t350\n";
                  char *p = szBuffer;

                  while (p != NULL && \*p != '\\0')
                  {
                      if (strncmp(p, "20000104\\t0900", 13) == 0)
                      {
                          printf("Found it!\\n");
                          break;;
                      }
                  
                      // advance to the next 'line'
                      p += 18;
                  }   
                  

                  }


                  "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                  "Judge not by the eye but by the heart." - Native American Proverb

                  A Offline
                  A Offline
                  Arris74
                  wrote on last edited by
                  #9

                  many thanks It sounds great. I gonna try it and let you know the results.

                  1 Reply Last reply
                  0
                  • D David Crow

                    Arris7 wrote:

                    CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

                    Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

                    void main( void )
                    {
                    char *szBuffer = "20000103\t1658\t351\n"
                    "20000103\t1659\t352\n"
                    "20000103\t1700\t350\n"
                    "20000103\t1701\t352\n"
                    "20000103\t1702\t355\n"
                    "20000104\t0900\t354\n"
                    "20000104\t0901\t352\n"
                    "20000104\t0902\t350\n";
                    char *p = szBuffer;

                    while (p != NULL && \*p != '\\0')
                    {
                        if (strncmp(p, "20000104\\t0900", 13) == 0)
                        {
                            printf("Found it!\\n");
                            break;;
                        }
                    
                        // advance to the next 'line'
                        p += 18;
                    }   
                    

                    }


                    "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                    "Judge not by the eye but by the heart." - Native American Proverb

                    PJ ArendsP Offline
                    PJ ArendsP Offline
                    PJ Arends
                    wrote on last edited by
                    #10

                    For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                    You may be right
                    I may be crazy
                    -- Billy Joel --

                    Within you lies the power for good, use it!!!

                    Within you lies the power for good; Use it!

                    D A 2 Replies Last reply
                    0
                    • PJ ArendsP PJ Arends

                      For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                      You may be right
                      I may be crazy
                      -- Billy Joel --

                      Within you lies the power for good, use it!!!

                      D Offline
                      D Offline
                      David Crow
                      wrote on last edited by
                      #11

                      PJ Arends wrote:

                      Seeing how the file is sorted by date and time...

                      Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                      "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                      "Judge not by the eye but by the heart." - Native American Proverb

                      L A 2 Replies Last reply
                      0
                      • D David Crow

                        PJ Arends wrote:

                        Seeing how the file is sorted by date and time...

                        Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                        "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                        "Judge not by the eye but by the heart." - Native American Proverb

                        L Offline
                        L Offline
                        led mike
                        wrote on last edited by
                        #12

                        If it is not sorted a sorted solution would likely be optimal for 4 million records. Either sorting the orginal file or creating and index file or memory based index, perhaps a Database should be considered.

                        led mike

                        1 Reply Last reply
                        0
                        • D David Crow

                          PJ Arends wrote:

                          Seeing how the file is sorted by date and time...

                          Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                          "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                          "Judge not by the eye but by the heart." - Native American Proverb

                          A Offline
                          A Offline
                          Arris74
                          wrote on last edited by
                          #13

                          Many thanks for your solutions. I think that your idea of using strncmp(p, "20000104\t0900", 13) == 0) in a binary search is a good answer. I'll try it right now. thnks again

                          1 Reply Last reply
                          0
                          • PJ ArendsP PJ Arends

                            For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                            You may be right
                            I may be crazy
                            -- Billy Joel --

                            Within you lies the power for good, use it!!!

                            A Offline
                            A Offline
                            Arris74
                            wrote on last edited by
                            #14

                            Thnks for this solution. yes the file is sorted by date and time so Binary search is a good answer to my question. thnks again

                            1 Reply Last reply
                            0
                            • A Arris74

                              Sorry I was not clear in my first post. Actually I just know the date and the time where to start. for example the starting date and time can be located at the line 250000 and I have to find the line.

                              M Offline
                              M Offline
                              malaugh
                              wrote on last edited by
                              #15

                              Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.

                              A 1 Reply Last reply
                              0
                              • M malaugh

                                Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.

                                A Offline
                                A Offline
                                Arris74
                                wrote on last edited by
                                #16

                                thanks, Yes dates are in order. the method you suggest is a binary search. I think it is the best solution for my problem.

                                1 Reply Last reply
                                0
                                Reply
                                • Reply as topic
                                Log in to reply
                                • Oldest to Newest
                                • Newest to Oldest
                                • Most Votes


                                • Login

                                • Don't have an account? Register

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • World
                                • Users
                                • Groups