How to speed up copy of .txt files into arrays?
-
Hello I use a Text file with following structure with about 4 millions lines and the separator is a â\tâ (tabulation). "Date" "Time" "Signal" 20000103 1658 351 20000103 1659 352 20000103 1700 350 20000103 1701 352 20000103 1702 355 20000104 0900 354 20000104 0901 352 20000104 0902 350 I would like to copy the columns Date, Time and Signal into a STL vectors containers. But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000. The criteria to define the line where to start and finish copying is the column Date and Time I currently use CFile and I copy the whole file into a buffer and I close the file. This is very fast, less than 1 second. Then I use the function strtok to read the buffer string by string. And I count the lines until I find the date and time thatI want. Unfortunately my method takes 57 seconds to cross all the file just to pick up the start and end line numbers. So I am wondering if is there is a possibility to read the buffer or a file column by column? Does someone knows or heard a method to swiftly read a string buffer or a file? I thank you very much If you can advice solutions that help to speed up this process. For information, I tested the function Tokenize of CString but is is very slow. I also tested CStdioFile and ReadString function but it is also very slow.
Arris7 wrote:
But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.
Use the
CFile::Seek()
method to go to offset 250000*length_of_line. Without testing, this is 4500000.Arris7 wrote:
I currently use CFile and I copy the whole file into a buffer and I close the file.
I would consider using
CStdioFile
withCMemFile
. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
Arris7 wrote:
But I do not want to copy all the file. For example a just need to start at line 250000 and stop at line 3000000.
Use the
CFile::Seek()
method to go to offset 250000*length_of_line. Without testing, this is 4500000.Arris7 wrote:
I currently use CFile and I copy the whole file into a buffer and I close the file.
I would consider using
CStdioFile
withCMemFile
. That way your file is processed in memory, rather than on disk, and you can utilize line-parsing functions.
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.
-
Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.
Arris7 wrote:
actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.
That contradicts your first post:
Arris7 wrote:
For example a just need to start at line 250000 and stop at line 3000000
So which is it?
led mike
-
Thnks. CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable. Do you have a code sample on how to use CStdioFile with CMemFile together? Which functions are used for parsing? I just saw CStdioFile:: ReadString() which cannot be used with CMemFile.
Arris7 wrote:
CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.
Fair enough, but you can still do it via a simple calculation, rather than using
strtok()
to find each line.strtok()
is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:void main( void )
{
char *szBuffer = "20000103\t1658\t351\n"
"20000103\t1659\t352\n"
"20000103\t1700\t350\n"
"20000103\t1701\t352\n"
"20000103\t1702\t355\n"
"20000104\t0900\t354\n"
"20000104\t0901\t352\n"
"20000104\t0902\t350\n";
char *p = szBuffer;while (p != NULL && \*p != '\\0') { if (strncmp(p, "20000104\\t0900", 13) == 0) { printf("Found it!\\n"); break;; } // advance to the next 'line' p += 18; }
}
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
Arris7 wrote:
actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.
That contradicts your first post:
Arris7 wrote:
For example a just need to start at line 250000 and stop at line 3000000
So which is it?
led mike
-
Arris7 wrote:
CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.
Fair enough, but you can still do it via a simple calculation, rather than using
strtok()
to find each line.strtok()
is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:void main( void )
{
char *szBuffer = "20000103\t1658\t351\n"
"20000103\t1659\t352\n"
"20000103\t1700\t350\n"
"20000103\t1701\t352\n"
"20000103\t1702\t355\n"
"20000104\t0900\t354\n"
"20000104\t0901\t352\n"
"20000104\t0902\t350\n";
char *p = szBuffer;while (p != NULL && \*p != '\\0') { if (strncmp(p, "20000104\\t0900", 13) == 0) { printf("Found it!\\n"); break;; } // advance to the next 'line' p += 18; }
}
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
Arris7 wrote:
CFile::Seek() doesn't help because actually I do not know at which line to start and finish. I just know a starting and ending date and time. So I need to read the String and to compare it with a variable.
Fair enough, but you can still do it via a simple calculation, rather than using
strtok()
to find each line.strtok()
is slowing you down as it has to examine each character to find the one you want. Since each line of the file is 18-19 characters in length, just compare the first 13 of those. Something like:void main( void )
{
char *szBuffer = "20000103\t1658\t351\n"
"20000103\t1659\t352\n"
"20000103\t1700\t350\n"
"20000103\t1701\t352\n"
"20000103\t1702\t355\n"
"20000104\t0900\t354\n"
"20000104\t0901\t352\n"
"20000104\t0902\t350\n";
char *p = szBuffer;while (p != NULL && \*p != '\\0') { if (strncmp(p, "20000104\\t0900", 13) == 0) { printf("Found it!\\n"); break;; } // advance to the next 'line' p += 18; }
}
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using
CFile::Seek
to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.
You may be right
I may be crazy
-- Billy Joel --Within you lies the power for good, use it!!!
-
For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using
CFile::Seek
to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.
You may be right
I may be crazy
-- Billy Joel --Within you lies the power for good, use it!!!
PJ Arends wrote:
Seeing how the file is sorted by date and time...
Was that a guarantee? If so, then a binary search via
Seek()
is the way to go.
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
PJ Arends wrote:
Seeing how the file is sorted by date and time...
Was that a guarantee? If so, then a binary search via
Seek()
is the way to go.
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
PJ Arends wrote:
Seeing how the file is sorted by date and time...
Was that a guarantee? If so, then a binary search via
Seek()
is the way to go.
"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15
"Judge not by the eye but by the heart." - Native American Proverb
-
For very large files (millions of lines) this will still take a while. Seeing how the file is sorted by date and time I would use your original idea of using
CFile::Seek
to do a binary search of the file. A binary search will be slower if the required data is right at the start of the file, but a heck of a lot faster if the data is anywhere else.
You may be right
I may be crazy
-- Billy Joel --Within you lies the power for good, use it!!!
-
Sorry I was not clear in my first post. Actually I just know the date and the time where to start. for example the starting date and time can be located at the line 250000 and I have to find the line.
Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.
-
Are the dates in order? If the file starts with the earliest date and finishes with the latest date, then use the following method: 1) Fseek to the middle of the file 2) If the data is larger fseek to one quter of the way through the file, if smaller, fseek to 3 quarters of the way though the file. 3) Repeat. Like guessing a number, if you ask someone select a number between 0 and 15, the quickest way to find the number is to ask, Is it less than 8 If yes then ask is it less than 4 If no than ask is it less than 6 I'm sure you get the idea. Its a common technique.