Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. MFC Program and Very Large Text Files

MFC Program and Very Large Text Files

Scheduled Pinned Locked Moved C / C++ / MFC
c++helpannouncement
11 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    Andy202
    wrote on last edited by
    #1

    For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files. I also used class CNewStringArray (which is a modified version of CStringArray by Anders M Eriksson, again via the Codeproject) I use the MFC function AfxExtractSubString() to extract the various fields of the CSV record and CnewStringArray variables to hold and process the data. Now I have a problem where the size of some of the files to be processed may be in the order of 250 to 700Gb. I have never used files larger that about 10Mb, but will I have problems with files these sizes and do I need to consider new methods of processing these very large files. Any comments and advice please.

    M C J 3 Replies Last reply
    0
    • A Andy202

      For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files. I also used class CNewStringArray (which is a modified version of CStringArray by Anders M Eriksson, again via the Codeproject) I use the MFC function AfxExtractSubString() to extract the various fields of the CSV record and CnewStringArray variables to hold and process the data. Now I have a problem where the size of some of the files to be processed may be in the order of 250 to 700Gb. I have never used files larger that about 10Mb, but will I have problems with files these sizes and do I need to consider new methods of processing these very large files. Any comments and advice please.

      M Offline
      M Offline
      Maximilien
      wrote on last edited by
      #2

      I'm not an expert on that topic, but I think you have to look at Memory Map Files. That will let you navigate the file without having to load the whole file in memory.

      Watched code never compiles.

      F 1 Reply Last reply
      0
      • A Andy202

        For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files. I also used class CNewStringArray (which is a modified version of CStringArray by Anders M Eriksson, again via the Codeproject) I use the MFC function AfxExtractSubString() to extract the various fields of the CSV record and CnewStringArray variables to hold and process the data. Now I have a problem where the size of some of the files to be processed may be in the order of 250 to 700Gb. I have never used files larger that about 10Mb, but will I have problems with files these sizes and do I need to consider new methods of processing these very large files. Any comments and advice please.

        C Offline
        C Offline
        Chandrasekharan P
        wrote on last edited by
        #3

        250 to 700GB of data or is it a typo error???? if its 250 to 700 Gb of data, where are your data stored??

        1 Reply Last reply
        0
        • M Maximilien

          I'm not an expert on that topic, but I think you have to look at Memory Map Files. That will let you navigate the file without having to load the whole file in memory.

          Watched code never compiles.

          F Offline
          F Offline
          federico strati
          wrote on last edited by
          #4

          Yes, you definitely are better off using memory mapped files, namely, you should use "CreateFileMapping", "MapViewOfFile", "UnmapViewOfFile" and the likes. It is a bit outdated, but the book from Jeffrey Richter "Programming Applications for Microsoft Windows" has a good introduction to such an API. Maybe, looking into MSDN with these pointers will lead you to the correct API to use for your Operating System version. Cheers Federico yeap, it is still the same API in latest versions of Windows, just checked on MSDN: http://msdn.microsoft.com/en-us/library/aa366537(v=VS.85).aspx

          modified on Thursday, May 19, 2011 7:51 AM

          A 1 Reply Last reply
          0
          • A Andy202

            For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files. I also used class CNewStringArray (which is a modified version of CStringArray by Anders M Eriksson, again via the Codeproject) I use the MFC function AfxExtractSubString() to extract the various fields of the CSV record and CnewStringArray variables to hold and process the data. Now I have a problem where the size of some of the files to be processed may be in the order of 250 to 700Gb. I have never used files larger that about 10Mb, but will I have problems with files these sizes and do I need to consider new methods of processing these very large files. Any comments and advice please.

            J Offline
            J Offline
            jschell
            wrote on last edited by
            #5

            Andy202 wrote:

            For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files.

            Presumably you are doing something more than just creating one file out of two. Consequently your design with much larger files, needs to take into account specifically what you need to do. And if it was me I would also look into the business requirement that requires that the output be another file. Pretty pointless to parse a large file and then parse it again.

            A 1 Reply Last reply
            0
            • F federico strati

              Yes, you definitely are better off using memory mapped files, namely, you should use "CreateFileMapping", "MapViewOfFile", "UnmapViewOfFile" and the likes. It is a bit outdated, but the book from Jeffrey Richter "Programming Applications for Microsoft Windows" has a good introduction to such an API. Maybe, looking into MSDN with these pointers will lead you to the correct API to use for your Operating System version. Cheers Federico yeap, it is still the same API in latest versions of Windows, just checked on MSDN: http://msdn.microsoft.com/en-us/library/aa366537(v=VS.85).aspx

              modified on Thursday, May 19, 2011 7:51 AM

              A Offline
              A Offline
              Andy202
              wrote on last edited by
              #6

              Thanks Federico for your post. I have used these API's, but for large auto generated data structures ~ 50k bytes. I did look at he link you gave and the following concerns me:- If the file mapping object is backed by the operating system paging file (the hfile parameter is INVALID_HANDLE_VALUE), specifies that when a view of the file is mapped into a process address space, the entire range of pages is committed rather than reserved. The system must have enough committable pages to hold the entire mapping. Otherwise, CreateFileMapping fails. With files sizes of ~ 500GB will these APIs work? Andy.

              F 1 Reply Last reply
              0
              • J jschell

                Andy202 wrote:

                For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files.

                Presumably you are doing something more than just creating one file out of two. Consequently your design with much larger files, needs to take into account specifically what you need to do. And if it was me I would also look into the business requirement that requires that the output be another file. Pretty pointless to parse a large file and then parse it again.

                A Offline
                A Offline
                Andy202
                wrote on last edited by
                #7

                Thanks for your post jschell. Two input files to generate one output file. Input1 sampled at 50 msec and Input2 sampled at 200 msec, so using timing information merge the two sets of data into one output file; interporlate as required. I can do what is required if the files were as follows:- Input1 = 1 Mb; Input2 = 2Mb and Output = 5Mb. Andy

                J 1 Reply Last reply
                0
                • A Andy202

                  Thanks for your post jschell. Two input files to generate one output file. Input1 sampled at 50 msec and Input2 sampled at 200 msec, so using timing information merge the two sets of data into one output file; interporlate as required. I can do what is required if the files were as follows:- Input1 = 1 Mb; Input2 = 2Mb and Output = 5Mb. Andy

                  J Offline
                  J Offline
                  jschell
                  wrote on last edited by
                  #8

                  Andy202 wrote:

                  Input1 sampled at 50 msec and Input2 sampled at 200 msec, so using timing information merge the two sets of data into one output file; interporlate as required.

                  So you have one 'block' (perhaps one line or longer) from file 1 that exists every 50msec. Call this X. You have another 'block' from file 2 that exists every 200msec. Call this Y. Thus you have 4 (200msec/50msect) X entries before each Y entry. Steps 1. Build a buffered reader for each file. Buffered in the case reads N 'block' entries and allows it to read M more on request. The timestamp is exposed (parsed from the block.) 2. Read X via the buffer where N is 200. 3. Read Y via the buffer where N is also 200 (could be less as well.) 4. Now 2 will have data that fits into 3. Because you have read enough to overlap. 5. Step 4 represents a starting point. Basically the two buffers will mostly be off by 4. 6. You can't assume it will always be offset by 4 so continue to compare the two timestamps as you read. 7. On start up you need to sync the two buffer reads, since one file might have a much different start time from the other. Performance impacts - Play with the stream buffer sizes (actual file read versus buffered readers above.) - Profile it with a tool for some large and real files, say 50 meg at least.

                  1 Reply Last reply
                  0
                  • A Andy202

                    Thanks Federico for your post. I have used these API's, but for large auto generated data structures ~ 50k bytes. I did look at he link you gave and the following concerns me:- If the file mapping object is backed by the operating system paging file (the hfile parameter is INVALID_HANDLE_VALUE), specifies that when a view of the file is mapped into a process address space, the entire range of pages is committed rather than reserved. The system must have enough committable pages to hold the entire mapping. Otherwise, CreateFileMapping fails. With files sizes of ~ 500GB will these APIs work? Andy.

                    F Offline
                    F Offline
                    federico strati
                    wrote on last edited by
                    #9

                    Hello Andy, what follows is extracted from the book from Jeffrey Richter "Programming Applications for Microsoft Windows" It is written for a 32bit O.S. so be careful to adapt it in case you work on Win 7 --- start --- Processing a Big File Using Memory-Mapped Files In an earlier section, I said I would tell you how to map a 16-EB file into a small address space. Well, you can't. Instead, you must map a view of the file that contains only a small portion of the file's data. You should start by mapping a view of the very beginning of the file. When you've finished accessing the first view of the file, you can unmap it and then map a new view starting at an offset deeper within the file. You'll need to repeat this process until you access the complete file. This certainly makes dealing with large memory-mapped files less convenient, but fortunately most files are small enough that this problem doesn't usually come up. Let's look at an example using an 8-GB file and a 32-bit address space. Here is a routine that counts all the 0 bytes in a binary data file in several steps:

                    __int64 Count0s(void) {

                    // Views must always start on a multiple
                    // of the allocation granularity

                    SYSTEM_INFO sinf;
                    GetSystemInfo(&sinf);

                    // Open the data file.

                    HANDLE hFile = CreateFile("C:\\HugeFile.Big", GENERIC_READ,
                    FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);

                    // Create the file-mapping object.
                    HANDLE hFileMapping = CreateFileMapping(hFile, NULL,
                    PAGE_READONLY, 0, 0, NULL);
                    DWORD dwFileSizeHigh;

                    __int64 qwFileSize = GetFileSize(hFile, &dwFileSizeHigh);
                    qwFileSize += (((__int64) dwFileSizeHigh) << 32);

                    // We no longer need access to the file object's handle.
                    CloseHandle(hFile);

                    __int64 qwFileOffset = 0, qwNumOf0s = 0;
                    while (qwFileSize > 0) {

                    // Determine the number of bytes to be mapped in this view
                    DWORD dwBytesInBlock = sinf.dwAllocationGranularity;
                    if (qwFileSize < sinf.dwAllocationGranularity)
                    dwBytesInBlock = (DWORD) qwFileSize;
                    PBYTE pbFile = (PBYTE) MapViewOfFile(hFileMapping, FILE_MAP_READ,
                    (DWORD) (qwFileOffset >> 32), // Starting byte
                    (DWORD) (qwFileOffset & 0xFFFFFFFF), // in file
                    dwBytesInBlock); // # of bytes to map
                    // Count the number of Js in this block.
                    for (DWORD dwByte = 0; dwByte < dwBytesInBlock; dwByte++) {
                    if (pbFile[dwByte] == 0)
                    qwNumOf0s++;
                    }
                    // Unmap the view; we don't want multiple views
                    //

                    A 1 Reply Last reply
                    0
                    • F federico strati

                      Hello Andy, what follows is extracted from the book from Jeffrey Richter "Programming Applications for Microsoft Windows" It is written for a 32bit O.S. so be careful to adapt it in case you work on Win 7 --- start --- Processing a Big File Using Memory-Mapped Files In an earlier section, I said I would tell you how to map a 16-EB file into a small address space. Well, you can't. Instead, you must map a view of the file that contains only a small portion of the file's data. You should start by mapping a view of the very beginning of the file. When you've finished accessing the first view of the file, you can unmap it and then map a new view starting at an offset deeper within the file. You'll need to repeat this process until you access the complete file. This certainly makes dealing with large memory-mapped files less convenient, but fortunately most files are small enough that this problem doesn't usually come up. Let's look at an example using an 8-GB file and a 32-bit address space. Here is a routine that counts all the 0 bytes in a binary data file in several steps:

                      __int64 Count0s(void) {

                      // Views must always start on a multiple
                      // of the allocation granularity

                      SYSTEM_INFO sinf;
                      GetSystemInfo(&sinf);

                      // Open the data file.

                      HANDLE hFile = CreateFile("C:\\HugeFile.Big", GENERIC_READ,
                      FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);

                      // Create the file-mapping object.
                      HANDLE hFileMapping = CreateFileMapping(hFile, NULL,
                      PAGE_READONLY, 0, 0, NULL);
                      DWORD dwFileSizeHigh;

                      __int64 qwFileSize = GetFileSize(hFile, &dwFileSizeHigh);
                      qwFileSize += (((__int64) dwFileSizeHigh) << 32);

                      // We no longer need access to the file object's handle.
                      CloseHandle(hFile);

                      __int64 qwFileOffset = 0, qwNumOf0s = 0;
                      while (qwFileSize > 0) {

                      // Determine the number of bytes to be mapped in this view
                      DWORD dwBytesInBlock = sinf.dwAllocationGranularity;
                      if (qwFileSize < sinf.dwAllocationGranularity)
                      dwBytesInBlock = (DWORD) qwFileSize;
                      PBYTE pbFile = (PBYTE) MapViewOfFile(hFileMapping, FILE_MAP_READ,
                      (DWORD) (qwFileOffset >> 32), // Starting byte
                      (DWORD) (qwFileOffset & 0xFFFFFFFF), // in file
                      dwBytesInBlock); // # of bytes to map
                      // Count the number of Js in this block.
                      for (DWORD dwByte = 0; dwByte < dwBytesInBlock; dwByte++) {
                      if (pbFile[dwByte] == 0)
                      qwNumOf0s++;
                      }
                      // Unmap the view; we don't want multiple views
                      //

                      A Offline
                      A Offline
                      Andy202
                      wrote on last edited by
                      #10

                      Thanks Federico for the information. The requirement has gone away, but I thought that it would be good to do this task (a worked example) should I ever need to revist this problem again. Just one follow up question you suggest 64 KB (the allocation granularity size)? Is this from expericence? And the best value. Andy

                      F 1 Reply Last reply
                      0
                      • A Andy202

                        Thanks Federico for the information. The requirement has gone away, but I thought that it would be good to do this task (a worked example) should I ever need to revist this problem again. Just one follow up question you suggest 64 KB (the allocation granularity size)? Is this from expericence? And the best value. Andy

                        F Offline
                        F Offline
                        federico strati
                        wrote on last edited by
                        #11

                        Hi, the allocation granularity size, if you look at the code, is system dependent:

                        SYSTEM_INFO sinf;
                        GetSystemInfo(&sinf);

                        [... snipped ...]

                        // Determine the number of bytes to be mapped in this view
                        DWORD dwBytesInBlock = sinf.dwAllocationGranularity;

                        I don't know the values for recent Windows OS's, I cited 64Kb just to say a size, you may just get them from the API (GetSystemInfo). As far as what would be the best value, I would say that you should map in multiples of such allocation size and other considerations (as available memory for the system and/or the single process) come into play. You may have to experiment a bit to find the best for your requirements. Cheers Federico

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups