Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. _findfirst and fopen very slow

_findfirst and fopen very slow

Scheduled Pinned Locked Moved C / C++ / MFC
graphicshelpquestion
28 Posts 7 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C cristiapi

    I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

    struct \_finddata\_t fd; long hFile;
    if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
    do {
    	FILE \*f= fopen(fd.name, "rS");
    	while(fgets(buf, sizeof(buf), f)) {
    		if(0 == \_strnicmp(buf, "ABCD", 4)) {
    			Save buf in a std::vector
    			break;
    		}
    	}
    	fclose(f);
    } while(\_findnext(hFile, &fd) == 0);
    \_findclose(hFile);
    

    Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

    D Offline
    D Offline
    David Crow
    wrote on last edited by
    #2

    Are the files that were in that folder on Monday (for example) still there on Tuesday? In other words, are you processing every file in that folder, or just the new ones? Keep in mind that file I/O is arguably one of the slowest operations on a computer. It can be sped up to a marginal degree, but you are ultimately at the mercy of the disk.

    "One man's wage rise is another man's price increase." - Harold Wilson

    "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

    "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

    C 1 Reply Last reply
    0
    • D David Crow

      Are the files that were in that folder on Monday (for example) still there on Tuesday? In other words, are you processing every file in that folder, or just the new ones? Keep in mind that file I/O is arguably one of the slowest operations on a computer. It can be sped up to a marginal degree, but you are ultimately at the mercy of the disk.

      "One man's wage rise is another man's price increase." - Harold Wilson

      "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

      "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

      C Offline
      C Offline
      cristiapi
      wrote on last edited by
      #3

      David Crow wrote:

      Are the files that were in that folder on Monday still there on Tuesday?

      Yes.

      In other words, are you processing every file in that folder, or just the new ones?

      Every file in that folder.

      D 1 Reply Last reply
      0
      • C cristiapi

        David Crow wrote:

        Are the files that were in that folder on Monday still there on Tuesday?

        Yes.

        In other words, are you processing every file in that folder, or just the new ones?

        Every file in that folder.

        D Offline
        D Offline
        David Crow
        wrote on last edited by
        #4

        So do the files that were processed on a Monday need to be processed again the next day?

        "One man's wage rise is another man's price increase." - Harold Wilson

        "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

        "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

        C 1 Reply Last reply
        0
        • C cristiapi

          I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

          struct \_finddata\_t fd; long hFile;
          if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
          do {
          	FILE \*f= fopen(fd.name, "rS");
          	while(fgets(buf, sizeof(buf), f)) {
          		if(0 == \_strnicmp(buf, "ABCD", 4)) {
          			Save buf in a std::vector
          			break;
          		}
          	}
          	fclose(f);
          } while(\_findnext(hFile, &fd) == 0);
          \_findclose(hFile);
          

          Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

          J Offline
          J Offline
          jeron1
          wrote on last edited by
          #5

          If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

          "the debugger doesn't tell me anything because this code compiles just fine" - random QA comment "Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst "I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

          C 1 Reply Last reply
          0
          • J jeron1

            If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

            "the debugger doesn't tell me anything because this code compiles just fine" - random QA comment "Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst "I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

            C Offline
            C Offline
            cristiapi
            wrote on last edited by
            #6

            jeron1 wrote:

            If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

            Yes; that code takes a totally negligible amount of time. If I only do the search, without opening the file, it's very fast, but if I also open the file, the code is terribly slow.

            D J 2 Replies Last reply
            0
            • D David Crow

              So do the files that were processed on a Monday need to be processed again the next day?

              "One man's wage rise is another man's price increase." - Harold Wilson

              "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

              "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

              C Offline
              C Offline
              cristiapi
              wrote on last edited by
              #7

              David Crow wrote:

              So do the files that were processed on a Monday need to be processed again the next day?

              Yes. All those files can be thought as a database that I need to read every time I start the program.

              D 1 Reply Last reply
              0
              • C cristiapi

                David Crow wrote:

                So do the files that were processed on a Monday need to be processed again the next day?

                Yes. All those files can be thought as a database that I need to read every time I start the program.

                D Offline
                D Offline
                David Crow
                wrote on last edited by
                #8

                Have you considered doing the processing in a (background) worker thread?

                "One man's wage rise is another man's price increase." - Harold Wilson

                "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

                C 1 Reply Last reply
                0
                • C cristiapi

                  jeron1 wrote:

                  If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

                  Yes; that code takes a totally negligible amount of time. If I only do the search, without opening the file, it's very fast, but if I also open the file, the code is terribly slow.

                  D Offline
                  D Offline
                  David Crow
                  wrote on last edited by
                  #9

                  Member 3648633 wrote:

                  If I only do the search, without opening the file

                  How can you search the contents of a file without first opening it?

                  "One man's wage rise is another man's price increase." - Harold Wilson

                  "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                  "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

                  C 1 Reply Last reply
                  0
                  • D David Crow

                    Have you considered doing the processing in a (background) worker thread?

                    "One man's wage rise is another man's price increase." - Harold Wilson

                    "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                    "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

                    C Offline
                    C Offline
                    cristiapi
                    wrote on last edited by
                    #10

                    No, because when I start the program, I need to first process the files.

                    D 1 Reply Last reply
                    0
                    • D David Crow

                      Member 3648633 wrote:

                      If I only do the search, without opening the file

                      How can you search the contents of a file without first opening it?

                      "One man's wage rise is another man's price increase." - Harold Wilson

                      "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                      "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

                      C Offline
                      C Offline
                      cristiapi
                      wrote on last edited by
                      #11

                      For "search" I meant the file enumeration with _findnext().

                      1 Reply Last reply
                      0
                      • C cristiapi

                        No, because when I start the program, I need to first process the files.

                        D Offline
                        D Offline
                        David Crow
                        wrote on last edited by
                        #12

                        That fact does not negate the need/use of a worker thread. Part of the program's slowness may be that of perception. By having a responsive UI (not saying that yours is), the perception that the program is running slow is minimized.

                        "One man's wage rise is another man's price increase." - Harold Wilson

                        "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                        "You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

                        1 Reply Last reply
                        0
                        • C cristiapi

                          jeron1 wrote:

                          If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

                          Yes; that code takes a totally negligible amount of time. If I only do the search, without opening the file, it's very fast, but if I also open the file, the code is terribly slow.

                          J Offline
                          J Offline
                          jeron1
                          wrote on last edited by
                          #13

                          Maybe read the whole file at once as opposed to many fgets() calls, then do your string search in RAM?

                          "the debugger doesn't tell me anything because this code compiles just fine" - random QA comment "Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst "I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

                          C 1 Reply Last reply
                          0
                          • C cristiapi

                            I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

                            struct \_finddata\_t fd; long hFile;
                            if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
                            do {
                            	FILE \*f= fopen(fd.name, "rS");
                            	while(fgets(buf, sizeof(buf), f)) {
                            		if(0 == \_strnicmp(buf, "ABCD", 4)) {
                            			Save buf in a std::vector
                            			break;
                            		}
                            	}
                            	fclose(f);
                            } while(\_findnext(hFile, &fd) == 0);
                            \_findclose(hFile);
                            

                            Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

                            L Offline
                            L Offline
                            Lost User
                            wrote on last edited by
                            #14

                            Hi, Some comments: 1.) Reading files may be slightly faster if you read using a multiple of the drive sector size or read it all at once. 2.) If you've ever wondered why file-backed cache implementations save files into a hierarchical folder structure it's because enumerating 10,000 files in a single folder may cause a small performance hit. If you plan on storing many thousands of files you may want to design a folder structure. Maybe something simple such as alphabetical A-C, D-F ... or something based on timestamp. This is not much of an issue on a modern SSD but old spindle drives take a performance hit. 3.) The code you have shown above is reading the file contents into a local buffer. You would get a huge performance boost by using the [MapViewOfFile function](https://docs.microsoft.com/en-us/windows/desktop/api/memoryapi/nf-memoryapi-mapviewoffile) to map the file directly into your process space. Have a look at the [Creating a View Within a File](https://docs.microsoft.com/en-us/windows/desktop/Memory/creating-a-view-within-a-file) sample. This sample is demonstrating how to take a large file and map only 1kb at a time into your process. Don't do that. You stated that your files are around ~40 kb so I'd recommend mapping the entire file into your process address space. I'd also recommend using two file mappings. While FileA is being processed you can have the operating system map FileB into your process. This would mitigate any latency caused by the i/o subsystem. The majority of your latency is between opening files. I highly recommend the second file mapping. Best Wishes, -David Delaune

                            C 1 Reply Last reply
                            0
                            • C cristiapi

                              I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

                              struct \_finddata\_t fd; long hFile;
                              if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
                              do {
                              	FILE \*f= fopen(fd.name, "rS");
                              	while(fgets(buf, sizeof(buf), f)) {
                              		if(0 == \_strnicmp(buf, "ABCD", 4)) {
                              			Save buf in a std::vector
                              			break;
                              		}
                              	}
                              	fclose(f);
                              } while(\_findnext(hFile, &fd) == 0);
                              \_findclose(hFile);
                              

                              Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

                              L Offline
                              L Offline
                              leon de boer
                              wrote on last edited by
                              #15

                              The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want. Another obvious choice is have the files on a ramdisk as there isn't much data. The whole process seems a bit backward to me you are working on the reading code not the writing code.

                              In vino veritas

                              S C 2 Replies Last reply
                              0
                              • C cristiapi

                                I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

                                struct \_finddata\_t fd; long hFile;
                                if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
                                do {
                                	FILE \*f= fopen(fd.name, "rS");
                                	while(fgets(buf, sizeof(buf), f)) {
                                		if(0 == \_strnicmp(buf, "ABCD", 4)) {
                                			Save buf in a std::vector
                                			break;
                                		}
                                	}
                                	fclose(f);
                                } while(\_findnext(hFile, &fd) == 0);
                                \_findclose(hFile);
                                

                                Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

                                S Offline
                                S Offline
                                Stefan_Lang
                                wrote on last edited by
                                #16

                                I suspect that the reason for your program to run much faster on the second run is that modern drives cache a certain amount of data, and therefore don't need to rely on slow hardware for repeatedly reading the same files. In your code, you read from each file line by line. Internally, these reads will trigger a request to read some block (or multiple blocks) of data. While each of these blocks is probably cached to be used for consecutive reads, any read request requiring a new block will cause another, slow, access to the hard disk. You could speed this up by reading the whole file in a single operation: query its size, allocate a sufficiently large buffer, open as binary, and read the whole file into that buffer. Then your internal while loop can request each line from that buffer, which should be considerably faster.

                                GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

                                1 Reply Last reply
                                0
                                • L leon de boer

                                  The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want. Another obvious choice is have the files on a ramdisk as there isn't much data. The whole process seems a bit backward to me you are working on the reading code not the writing code.

                                  In vino veritas

                                  S Offline
                                  S Offline
                                  Stefan_Lang
                                  wrote on last edited by
                                  #17

                                  That's a good approach, but as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it. A specific kind of filename wouldn't be enough. Your suggestion to include the writing into the problem solution is a good idea. However, if we do that, we might as well write all the data to a database. Retrieving the correct lines would then only require a simple SQL query.

                                  GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

                                  L C 2 Replies Last reply
                                  0
                                  • L Lost User

                                    Hi, Some comments: 1.) Reading files may be slightly faster if you read using a multiple of the drive sector size or read it all at once. 2.) If you've ever wondered why file-backed cache implementations save files into a hierarchical folder structure it's because enumerating 10,000 files in a single folder may cause a small performance hit. If you plan on storing many thousands of files you may want to design a folder structure. Maybe something simple such as alphabetical A-C, D-F ... or something based on timestamp. This is not much of an issue on a modern SSD but old spindle drives take a performance hit. 3.) The code you have shown above is reading the file contents into a local buffer. You would get a huge performance boost by using the [MapViewOfFile function](https://docs.microsoft.com/en-us/windows/desktop/api/memoryapi/nf-memoryapi-mapviewoffile) to map the file directly into your process space. Have a look at the [Creating a View Within a File](https://docs.microsoft.com/en-us/windows/desktop/Memory/creating-a-view-within-a-file) sample. This sample is demonstrating how to take a large file and map only 1kb at a time into your process. Don't do that. You stated that your files are around ~40 kb so I'd recommend mapping the entire file into your process address space. I'd also recommend using two file mappings. While FileA is being processed you can have the operating system map FileB into your process. This would mitigate any latency caused by the i/o subsystem. The majority of your latency is between opening files. I highly recommend the second file mapping. Best Wishes, -David Delaune

                                    C Offline
                                    C Offline
                                    cristiapi
                                    wrote on last edited by
                                    #18

                                    I tried with 1 file mapping only; it works but the speed is exactly the same. Probably the only way is to merge all the file in one big file. That way I can also optimize the file format for my needs. Thanks you all.

                                    1 Reply Last reply
                                    0
                                    • S Stefan_Lang

                                      That's a good approach, but as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it. A specific kind of filename wouldn't be enough. Your suggestion to include the writing into the problem solution is a good idea. However, if we do that, we might as well write all the data to a database. Retrieving the correct lines would then only require a simple SQL query.

                                      GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

                                      L Offline
                                      L Offline
                                      leon de boer
                                      wrote on last edited by
                                      #19

                                      You only need one instance to flag it as special who cares how many times it writes the special sequence after that. The process is to eliminate the mass of files that aren't of any interest by using the name. I am not changing anything other than the name of the file .. hardly complex or rocket science and much easier and much faster than a database connection :-)

                                      In vino veritas

                                      S 1 Reply Last reply
                                      0
                                      • L leon de boer

                                        You only need one instance to flag it as special who cares how many times it writes the special sequence after that. The process is to eliminate the mass of files that aren't of any interest by using the name. I am not changing anything other than the name of the file .. hardly complex or rocket science and much easier and much faster than a database connection :-)

                                        In vino veritas

                                        S Offline
                                        S Offline
                                        Stefan_Lang
                                        wrote on last edited by
                                        #20

                                        I did not see the OP mention how many of the files do have that symbol. If a significant fraction of the files are affected, your solution would not help a lot.

                                        leon de boer wrote:

                                        much faster than a database connection

                                        .. to implement, sure. But certainly not to execute. ;)

                                        GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

                                        1 Reply Last reply
                                        0
                                        • C cristiapi

                                          I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

                                          struct \_finddata\_t fd; long hFile;
                                          if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
                                          do {
                                          	FILE \*f= fopen(fd.name, "rS");
                                          	while(fgets(buf, sizeof(buf), f)) {
                                          		if(0 == \_strnicmp(buf, "ABCD", 4)) {
                                          			Save buf in a std::vector
                                          			break;
                                          		}
                                          	}
                                          	fclose(f);
                                          } while(\_findnext(hFile, &fd) == 0);
                                          \_findclose(hFile);
                                          

                                          Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

                                          J Offline
                                          J Offline
                                          Joe Woodbury
                                          wrote on last edited by
                                          #21

                                          The main culprit is "fgets". Once you call that, the fopen series of calls immediately loads, I believe, 32k of data. On top of that fgets is relatively slow. For speed, you may be better off using fread, reading in 4k (or the page size) at a time and parsing the block yourself (by simply looking for ABCD. This could be sped up faster by doing a Boyer-Moore search, though since the string is short, simply scanning first for A and then checking for the rest may be faster. That said, I believe some new implementations of the standard library now include a Boyer-Moore algorithm.) Do also note that caching plays a big part here. Just recursing folders will take significantly longer the first pass than the second. This can be deceptive, however, since in actual operation those caches may be flushed between program runs.

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups