MFC Program and Very Large Text Files

Andy202

For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files. I also used class CNewStringArray (which is a modified version of CStringArray by Anders M Eriksson, again via the Codeproject) I use the MFC function AfxExtractSubString() to extract the various fields of the CSV record and CnewStringArray variables to hold and process the data. Now I have a problem where the size of some of the files to be processed may be in the order of 250 to 700Gb. I have never used files larger that about 10Mb, but will I have problems with files these sizes and do I need to consider new methods of processing these very large files. Any comments and advice please.

Maximilien

I'm not an expert on that topic, but I think you have to look at Memory Map Files. That will let you navigate the file without having to load the whole file in memory.

Watched code never compiles.

Chandrasekharan P

250 to 700GB of data or is it a typo error???? if its 250 to 700 Gb of data, where are your data stored??

federico strati

Yes, you definitely are better off using memory mapped files, namely, you should use "CreateFileMapping", "MapViewOfFile", "UnmapViewOfFile" and the likes. It is a bit outdated, but the book from Jeffrey Richter "Programming Applications for Microsoft Windows" has a good introduction to such an API. Maybe, looking into MSDN with these pointers will lead you to the correct API to use for your Operating System version. Cheers Federico yeap, it is still the same API in latest versions of Windows, just checked on MSDN: http://msdn.microsoft.com/en-us/library/aa366537(v=VS.85).aspx

modified on Thursday, May 19, 2011 7:51 AM

jschell

Andy202 wrote:

For merging together some recording data stored in CSV files I have used the class CtextFile (by Johan Rosengren via the Codeproject site) to read in and write out the processed files.

Presumably you are doing something more than just creating one file out of two. Consequently your design with much larger files, needs to take into account specifically what you need to do. And if it was me I would also look into the business requirement that requires that the output be another file. Pretty pointless to parse a large file and then parse it again.

Andy202 · modified on Thursday, May 19, 2011 7:51 AM

Thanks Federico for your post. I have used these API's, but for large auto generated data structures ~ 50k bytes. I did look at he link you gave and the following concerns me:- If the file mapping object is backed by the operating system paging file (the hfile parameter is INVALID_HANDLE_VALUE), specifies that when a view of the file is mapped into a process address space, the entire range of pages is committed rather than reserved. The system must have enough committable pages to hold the entire mapping. Otherwise, CreateFileMapping fails. With files sizes of ~ 500GB will these APIs work? Andy.

Andy202

Thanks for your post jschell. Two input files to generate one output file. Input1 sampled at 50 msec and Input2 sampled at 200 msec, so using timing information merge the two sets of data into one output file; interporlate as required. I can do what is required if the files were as follows:- Input1 = 1 Mb; Input2 = 2Mb and Output = 5Mb. Andy

jschell

Andy202 wrote:

Input1 sampled at 50 msec and Input2 sampled at 200 msec, so using timing information merge the two sets of data into one output file; interporlate as required.

So you have one 'block' (perhaps one line or longer) from file 1 that exists every 50msec. Call this X. You have another 'block' from file 2 that exists every 200msec. Call this Y. Thus you have 4 (200msec/50msect) X entries before each Y entry. Steps 1. Build a buffered reader for each file. Buffered in the case reads N 'block' entries and allows it to read M more on request. The timestamp is exposed (parsed from the block.) 2. Read X via the buffer where N is 200. 3. Read Y via the buffer where N is also 200 (could be less as well.) 4. Now 2 will have data that fits into 3. Because you have read enough to overlap. 5. Step 4 represents a starting point. Basically the two buffers will mostly be off by 4. 6. You can't assume it will always be offset by 4 so continue to compare the two timestamps as you read. 7. On start up you need to sync the two buffer reads, since one file might have a much different start time from the other. Performance impacts - Play with the stream buffer sizes (actual file read versus buffered readers above.) - Profile it with a tool for some large and real files, say 50 meg at least.

federico strati

Hello Andy, what follows is extracted from the book from Jeffrey Richter "Programming Applications for Microsoft Windows" It is written for a 32bit O.S. so be careful to adapt it in case you work on Win 7 --- start --- Processing a Big File Using Memory-Mapped Files In an earlier section, I said I would tell you how to map a 16-EB file into a small address space. Well, you can't. Instead, you must map a view of the file that contains only a small portion of the file's data. You should start by mapping a view of the very beginning of the file. When you've finished accessing the first view of the file, you can unmap it and then map a new view starting at an offset deeper within the file. You'll need to repeat this process until you access the complete file. This certainly makes dealing with large memory-mapped files less convenient, but fortunately most files are small enough that this problem doesn't usually come up. Let's look at an example using an 8-GB file and a 32-bit address space. Here is a routine that counts all the 0 bytes in a binary data file in several steps:

__int64 Count0s(void) {

// Views must always start on a multiple
// of the allocation granularity

SYSTEM_INFO sinf;
GetSystemInfo(&sinf);

// Open the data file.

HANDLE hFile = CreateFile("C:\\HugeFile.Big", GENERIC_READ,
FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);

// Create the file-mapping object.
HANDLE hFileMapping = CreateFileMapping(hFile, NULL,
PAGE_READONLY, 0, 0, NULL);
DWORD dwFileSizeHigh;

__int64 qwFileSize = GetFileSize(hFile, &dwFileSizeHigh);
qwFileSize += (((__int64) dwFileSizeHigh) << 32);

// We no longer need access to the file object's handle.
CloseHandle(hFile);

__int64 qwFileOffset = 0, qwNumOf0s = 0;
while (qwFileSize > 0) {

// Determine the number of bytes to be mapped in this view
DWORD dwBytesInBlock = sinf.dwAllocationGranularity;
if (qwFileSize < sinf.dwAllocationGranularity)
dwBytesInBlock = (DWORD) qwFileSize;
PBYTE pbFile = (PBYTE) MapViewOfFile(hFileMapping, FILE_MAP_READ,
(DWORD) (qwFileOffset >> 32), // Starting byte
(DWORD) (qwFileOffset & 0xFFFFFFFF), // in file
dwBytesInBlock); // # of bytes to map
// Count the number of Js in this block.
for (DWORD dwByte = 0; dwByte < dwBytesInBlock; dwByte++) {
if (pbFile[dwByte] == 0)
qwNumOf0s++;
}
// Unmap the view; we don't want multiple views
//

Andy202

Thanks Federico for the information. The requirement has gone away, but I thought that it would be good to do this task (a worked example) should I ever need to revist this problem again. Just one follow up question you suggest 64 KB (the allocation granularity size)? Is this from expericence? And the best value. Andy

federico strati

Hi, the allocation granularity size, if you look at the code, is system dependent:

SYSTEM_INFO sinf;
GetSystemInfo(&sinf);

[... snipped ...]

// Determine the number of bytes to be mapped in this view
DWORD dwBytesInBlock = sinf.dwAllocationGranularity;

I don't know the values for recent Windows OS's, I cited 64Kb just to say a size, you may just get them from the API (GetSystemInfo). As far as what would be the best value, I would say that you should map in multiples of such allocation size and other considerations (as available memory for the system and/or the single process) come into play. You may have to experiment a bit to find the best for your requirements. Cheers Federico