Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. _findfirst and fopen very slow

_findfirst and fopen very slow

Scheduled Pinned Locked Moved C / C++ / MFC
graphicshelpquestion
28 Posts 7 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Stefan_Lang

    That's a good approach, but as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it. A specific kind of filename wouldn't be enough. Your suggestion to include the writing into the problem solution is a good idea. However, if we do that, we might as well write all the data to a database. Retrieving the correct lines would then only require a simple SQL query.

    GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

    L Offline
    L Offline
    leon de boer
    wrote on last edited by
    #19

    You only need one instance to flag it as special who cares how many times it writes the special sequence after that. The process is to eliminate the mass of files that aren't of any interest by using the name. I am not changing anything other than the name of the file .. hardly complex or rocket science and much easier and much faster than a database connection :-)

    In vino veritas

    S 1 Reply Last reply
    0
    • L leon de boer

      You only need one instance to flag it as special who cares how many times it writes the special sequence after that. The process is to eliminate the mass of files that aren't of any interest by using the name. I am not changing anything other than the name of the file .. hardly complex or rocket science and much easier and much faster than a database connection :-)

      In vino veritas

      S Offline
      S Offline
      Stefan_Lang
      wrote on last edited by
      #20

      I did not see the OP mention how many of the files do have that symbol. If a significant fraction of the files are affected, your solution would not help a lot.

      leon de boer wrote:

      much faster than a database connection

      .. to implement, sure. But certainly not to execute. ;)

      GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

      1 Reply Last reply
      0
      • C cristiapi

        I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

        struct \_finddata\_t fd; long hFile;
        if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
        do {
        	FILE \*f= fopen(fd.name, "rS");
        	while(fgets(buf, sizeof(buf), f)) {
        		if(0 == \_strnicmp(buf, "ABCD", 4)) {
        			Save buf in a std::vector
        			break;
        		}
        	}
        	fclose(f);
        } while(\_findnext(hFile, &fd) == 0);
        \_findclose(hFile);
        

        Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

        J Offline
        J Offline
        Joe Woodbury
        wrote on last edited by
        #21

        The main culprit is "fgets". Once you call that, the fopen series of calls immediately loads, I believe, 32k of data. On top of that fgets is relatively slow. For speed, you may be better off using fread, reading in 4k (or the page size) at a time and parsing the block yourself (by simply looking for ABCD. This could be sped up faster by doing a Boyer-Moore search, though since the string is short, simply scanning first for A and then checking for the rest may be faster. That said, I believe some new implementations of the standard library now include a Boyer-Moore algorithm.) Do also note that caching plays a big part here. Just recursing folders will take significantly longer the first pass than the second. This can be deceptive, however, since in actual operation those caches may be flushed between program runs.

        1 Reply Last reply
        0
        • S Stefan_Lang

          That's a good approach, but as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it. A specific kind of filename wouldn't be enough. Your suggestion to include the writing into the problem solution is a good idea. However, if we do that, we might as well write all the data to a database. Retrieving the correct lines would then only require a simple SQL query.

          GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

          C Offline
          C Offline
          cristiapi
          wrote on last edited by
          #22

          Stefan_Lang wrote:

          as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it.

          Each file usually contains 190 to 220 lines and a file may or may not contain the wanted line (but almost all the files contain the wanted line). If the wanted line is in the file, there is only one line.

          1 Reply Last reply
          0
          • J jeron1

            Maybe read the whole file at once as opposed to many fgets() calls, then do your string search in RAM?

            "the debugger doesn't tell me anything because this code compiles just fine" - random QA comment "Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst "I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

            C Offline
            C Offline
            cristiapi
            wrote on last edited by
            #23

            I tried that method, but nothing changes. Any kind of file opening (including MapViewOfFile) terribly slows the process down.

            1 Reply Last reply
            0
            • L leon de boer

              The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want. Another obvious choice is have the files on a ramdisk as there isn't much data. The whole process seems a bit backward to me you are working on the reading code not the writing code.

              In vino veritas

              C Offline
              C Offline
              cristiapi
              wrote on last edited by
              #24

              leon de boer wrote:

              The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want.

              What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

              Another obvious choice is have the files on a ramdisk as there isn't much data.

              If I put the folder on the SSD, the process is much faster: 60.9 s for the HD and 2.9 s for the SSD. The SSD is an "unusual" location for that folder because all the other files are on the HD, but it's the easiest solution. Thank you

              L 1 Reply Last reply
              0
              • C cristiapi

                leon de boer wrote:

                The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want.

                What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

                Another obvious choice is have the files on a ramdisk as there isn't much data.

                If I put the folder on the SSD, the process is much faster: 60.9 s for the HD and 2.9 s for the SSD. The SSD is an "unusual" location for that folder because all the other files are on the HD, but it's the easiest solution. Thank you

                L Offline
                L Offline
                leon de boer
                wrote on last edited by
                #25

                Quote:

                What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

                Label them differently with a special name obviously, all you are doing is coming up with a filenaming convention :-) Hell use the file extension you already have (*.states) and mask the bits of it for what special strings are in it *.states = file with no special tags *.states1 = file with special tag 1 in it *.states2 = file with special tag 2 in it *.states3 = file with special tag 1 & 2 in it *.states4 = file with special tag 3 in it *.states5 = file with special tag 1 & 3 in it *.states6 = file with special tag 1 & 2 in it *.states7 = file with special tag 1, 2 & 3 in it You can know what tags are in the file without ever opening it all you need to know is the filename. This is also obviously a windows program why aren't you using the Windows API for the file open and reading?

                HANDLE Handle = CreateFile (fd.name, GENERIC_READ, FILE_SHARE_READ,
                0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); // Open the file
                if (Handle != INVALID_HANDLE_VALUE)
                {
                DWORD Actual;
                ReadFile(Handle, &buf[0], sizeof(buf)-1, &Actual, 0); // Read a buffer-1 of data (1 byte for #0 at end)
                if (Actual > 0)
                {
                buf[Actual] = 0; // Make sure asciiz terminated for next string op
                if(0 == _strnicmp(buf, "ABCD", 4)) {
                Save buf in a std::vector
                }
                }
                CloseHandle(Handle);
                }

                In vino veritas

                C 1 Reply Last reply
                0
                • L leon de boer

                  Quote:

                  What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

                  Label them differently with a special name obviously, all you are doing is coming up with a filenaming convention :-) Hell use the file extension you already have (*.states) and mask the bits of it for what special strings are in it *.states = file with no special tags *.states1 = file with special tag 1 in it *.states2 = file with special tag 2 in it *.states3 = file with special tag 1 & 2 in it *.states4 = file with special tag 3 in it *.states5 = file with special tag 1 & 3 in it *.states6 = file with special tag 1 & 2 in it *.states7 = file with special tag 1, 2 & 3 in it You can know what tags are in the file without ever opening it all you need to know is the filename. This is also obviously a windows program why aren't you using the Windows API for the file open and reading?

                  HANDLE Handle = CreateFile (fd.name, GENERIC_READ, FILE_SHARE_READ,
                  0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); // Open the file
                  if (Handle != INVALID_HANDLE_VALUE)
                  {
                  DWORD Actual;
                  ReadFile(Handle, &buf[0], sizeof(buf)-1, &Actual, 0); // Read a buffer-1 of data (1 byte for #0 at end)
                  if (Actual > 0)
                  {
                  buf[Actual] = 0; // Make sure asciiz terminated for next string op
                  if(0 == _strnicmp(buf, "ABCD", 4)) {
                  Save buf in a std::vector
                  }
                  }
                  CloseHandle(Handle);
                  }

                  In vino veritas

                  C Offline
                  C Offline
                  cristiapi
                  wrote on last edited by
                  #26

                  Since almost all the files contain the wanted string, I'll need to open almost all the files, so the speed up would be negligible. I don't use Win Api because, afaik, there is no fgets() equivalent and there is no speed up if I read the whole file at once.

                  L 1 Reply Last reply
                  0
                  • C cristiapi

                    Since almost all the files contain the wanted string, I'll need to open almost all the files, so the speed up would be negligible. I don't use Win Api because, afaik, there is no fgets() equivalent and there is no speed up if I read the whole file at once.

                    L Offline
                    L Offline
                    leon de boer
                    wrote on last edited by
                    #27

                    I gave you the fgets equivalent above (its only a couple of lines of code) .. I am not convinced it isn't faster because you will be using the standard console file handler for opening and reading thru the standard library. Anyhow I will leave you to it

                    In vino veritas

                    1 Reply Last reply
                    0
                    • C cristiapi

                      I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB. I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector. I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

                      struct \_finddata\_t fd; long hFile;
                      if((hFile=\_findfirst("\*.states", &fd))== -1L) return; // File not found
                      do {
                      	FILE \*f= fopen(fd.name, "rS");
                      	while(fgets(buf, sizeof(buf), f)) {
                      		if(0 == \_strnicmp(buf, "ABCD", 4)) {
                      			Save buf in a std::vector
                      			break;
                      		}
                      	}
                      	fclose(f);
                      } while(\_findnext(hFile, &fd) == 0);
                      \_findclose(hFile);
                      

                      Is there any way to speedup the code? If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

                      C Offline
                      C Offline
                      cristiapi
                      wrote on last edited by
                      #28

                      The definitive solution: ctrl-x from HDD to SSD, restart the PC (probably not needed), ctrl-x from SSD to HDD. Now the process takes 8.2 s instead of 61 s, which seems reasonable to me.

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups