Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How to speed up copy of .txt files into arrays?

How to speed up copy of .txt files into arrays?

Scheduled Pinned Locked Moved C / C++ / MFC
tutorialc++cssperformancehelp
16 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Arris74

    Hello I use a Text file with following structure with about 4 millions lines and the separator is a ‘\t’ (tabulation). "Date" "Time" "Signal" 20000103 1658 351 20000103 1659 352 20000103 1700 350 20000103 1701 352 20000103 1702 355 20000104 0900 354 20000104 0901 352 20000104 0902 350 I would like to copy the columns Date, Time and Signal into a STL vectors containers. But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000. The criteria to define the line where to start and finish copying is the column Date and Time I currently use CFile and I copy the whole file into a buffer and I close the file. This is very fast, less than 1 second. Then I use the function strtok to read the buffer string by string. And I count the lines until I find the date and time thatI want. Unfortunately my method takes 57 seconds to cross all the file just to pick up the start and end line numbers. So I am wondering if is there is a possibility to read the buffer or a file column by column? Does someone knows or heard a method to swiftly read a string buffer or a file? I thank you very much If you can advice solutions that help to speed up this process. For information, I tested the function Tokenize of CString but is is very slow. I also tested CStdioFile and ReadString function but it is also very slow.

    D Offline
    D Offline
    David Crow
    wrote on last edited by
    #4

    Arris7 wrote:

    But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.

    Use the CFile::Seek() method to go to offset 250000*length_of_line. Without testing, this is 4500000.

    Arris7 wrote:

    I currently use CFile and I copy the whole file into a buffer and I close the file.

    I would consider using CStdioFile with CMemFile. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.


    "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

    "Judge not by the eye but by the heart." - Native American Proverb

    A 1 Reply Last reply
    0
    • D David Crow

      Arris7 wrote:

      But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.

      Use the CFile::Seek() method to go to offset 250000*length_of_line. Without testing, this is 4500000.

      Arris7 wrote:

      I currently use CFile and I copy the whole file into a buffer and I close the file.

      I would consider using CStdioFile with CMemFile. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.


      "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

      "Judge not by the eye but by the heart." - Native American Proverb

      A Offline
      A Offline
      Arris74
      wrote on last edited by
      #5

      Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

      L D 2 Replies Last reply
      0
      • A Arris74

        Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

        L Offline
        L Offline
        led mike
        wrote on last edited by
        #6

        Arris7 wrote:

        actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

        That contradicts your first post:

        Arris7 wrote:

        For example a just need to start at line 250000 and stop at line 3000000

        So which is it?

        led mike

        A 1 Reply Last reply
        0
        • A Arris74

          Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.

          D Offline
          D Offline
          David Crow
          wrote on last edited by
          #7

          Arris7 wrote:

          CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

          Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

          void main( void )
          {
          char *szBuffer = "20000103\t1658\t351\n"
          "20000103\t1659\t352\n"
          "20000103\t1700\t350\n"
          "20000103\t1701\t352\n"
          "20000103\t1702\t355\n"
          "20000104\t0900\t354\n"
          "20000104\t0901\t352\n"
          "20000104\t0902\t350\n";
          char *p = szBuffer;

          while (p != NULL && \*p != '\\0')
          {
              if (strncmp(p, "20000104\\t0900", 13) == 0)
              {
                  printf("Found it!\\n");
                  break;;
              }
          
              // advance to the next 'line'
              p += 18;
          }   
          

          }


          "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

          "Judge not by the eye but by the heart." - Native American Proverb

          A PJ ArendsP 2 Replies Last reply
          0
          • L led mike

            Arris7 wrote:

            actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

            That contradicts your first post:

            Arris7 wrote:

            For example a just need to start at line 250000 and stop at line 3000000

            So which is it?

            led mike

            A Offline
            A Offline
            Arris74
            wrote on last edited by
            #8

            Sorry I was not clear in my first post. Actually I just know the date and the time where to start. for example the starting date and time can be located at the line 250000 and I have to find the line.

            M 1 Reply Last reply
            0
            • D David Crow

              Arris7 wrote:

              CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

              Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

              void main( void )
              {
              char *szBuffer = "20000103\t1658\t351\n"
              "20000103\t1659\t352\n"
              "20000103\t1700\t350\n"
              "20000103\t1701\t352\n"
              "20000103\t1702\t355\n"
              "20000104\t0900\t354\n"
              "20000104\t0901\t352\n"
              "20000104\t0902\t350\n";
              char *p = szBuffer;

              while (p != NULL && \*p != '\\0')
              {
                  if (strncmp(p, "20000104\\t0900", 13) == 0)
                  {
                      printf("Found it!\\n");
                      break;;
                  }
              
                  // advance to the next 'line'
                  p += 18;
              }   
              

              }


              "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

              "Judge not by the eye but by the heart." - Native American Proverb

              A Offline
              A Offline
              Arris74
              wrote on last edited by
              #9

              many thanks It sounds great. I gonna try it and let you know the results.

              1 Reply Last reply
              0
              • D David Crow

                Arris7 wrote:

                CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.

                Fair enough, but you can still do it via a simple calculation, rather than using strtok() to find each line. strtok() is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:

                void main( void )
                {
                char *szBuffer = "20000103\t1658\t351\n"
                "20000103\t1659\t352\n"
                "20000103\t1700\t350\n"
                "20000103\t1701\t352\n"
                "20000103\t1702\t355\n"
                "20000104\t0900\t354\n"
                "20000104\t0901\t352\n"
                "20000104\t0902\t350\n";
                char *p = szBuffer;

                while (p != NULL && \*p != '\\0')
                {
                    if (strncmp(p, "20000104\\t0900", 13) == 0)
                    {
                        printf("Found it!\\n");
                        break;;
                    }
                
                    // advance to the next 'line'
                    p += 18;
                }   
                

                }


                "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                "Judge not by the eye but by the heart." - Native American Proverb

                PJ ArendsP Offline
                PJ ArendsP Offline
                PJ Arends
                wrote on last edited by
                #10

                For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                You may be right
                I may be crazy
                -- Billy Joel --

                Within you lies the power for good, use it!!!

                Within you lies the power for good; Use it!

                D A 2 Replies Last reply
                0
                • PJ ArendsP PJ Arends

                  For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                  You may be right
                  I may be crazy
                  -- Billy Joel --

                  Within you lies the power for good, use it!!!

                  D Offline
                  D Offline
                  David Crow
                  wrote on last edited by
                  #11

                  PJ Arends wrote:

                  Seeing how the file is sorted by date and time...

                  Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                  "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                  "Judge not by the eye but by the heart." - Native American Proverb

                  L A 2 Replies Last reply
                  0
                  • D David Crow

                    PJ Arends wrote:

                    Seeing how the file is sorted by date and time...

                    Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                    "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                    "Judge not by the eye but by the heart." - Native American Proverb

                    L Offline
                    L Offline
                    led mike
                    wrote on last edited by
                    #12

                    If it is not sorted a sorted solution would likely be optimal for 4 million records. Either sorting the orginal file or creating and index file or memory based index, perhaps a Database should be considered.

                    led mike

                    1 Reply Last reply
                    0
                    • D David Crow

                      PJ Arends wrote:

                      Seeing how the file is sorted by date and time...

                      Was that a guarantee? If so, then a binary search via Seek() is the way to go.


                      "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

                      "Judge not by the eye but by the heart." - Native American Proverb

                      A Offline
                      A Offline
                      Arris74
                      wrote on last edited by
                      #13

                      Many thanks for your solutions. I think that your idea of using strncmp(p, "20000104\t0900", 13) == 0) in a binary search is a good answer. I'll try it right now. thnks again

                      1 Reply Last reply
                      0
                      • PJ ArendsP PJ Arends

                        For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using CFile::Seek to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.


                        You may be right
                        I may be crazy
                        -- Billy Joel --

                        Within you lies the power for good, use it!!!

                        A Offline
                        A Offline
                        Arris74
                        wrote on last edited by
                        #14

                        Thnks for this solution. yes the file is sorted by date and time so Binary search is a good answer to my question. thnks again

                        1 Reply Last reply
                        0
                        • A Arris74

                          Sorry I was not clear in my first post. Actually I just know the date and the time where to start. for example the starting date and time can be located at the line 250000 and I have to find the line.

                          M Offline
                          M Offline
                          malaugh
                          wrote on last edited by
                          #15

                          Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.

                          A 1 Reply Last reply
                          0
                          • M malaugh

                            Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.

                            A Offline
                            A Offline
                            Arris74
                            wrote on last edited by
                            #16

                            thanks, Yes dates are in order. the method you suggest is a binary search. I think it is the best solution for my problem.

                            1 Reply Last reply
                            0
                            Reply
                            • Reply as topic
                            Log in to reply
                            • Oldest to Newest
                            • Newest to Oldest
                            • Most Votes


                            • Login

                            • Don't have an account? Register

                            • Login or register to search.
                            • First post
                              Last post
                            0
                            • Categories
                            • Recent
                            • Tags
                            • Popular
                            • World
                            • Users
                            • Groups