Extract data out of a LARGE text file

amatbrewer

Before I put a lot of effort into writing something from scratch, I bet someone out there already has most of what I need. I need to extract some data out of a log file. What I need to do is search a text file for a given string to find the first line of a block of data I want to process, extract some data out of that line, then read the next line and use a value in that to know how many more lines to read and process before looking for the next block and doing it again…and again…and again till the end of the file is reached. Pretty simple, but why reinvent the wheel? And I bet there are some really cool ways of doing this I would probably never think of. While you are at it any recommendations/warnings on doing this with VERY large text files (>2Gb)?

David Wilkes

Luc Pattyn

Hi, I think it is not wise to have log files that large. Why not create a series of normal sized log files instead ? Just start a new file each day/each hour/whatever is appropriate. Simply include the date/time in the file name to keep them apart. You can keep them all in one folder, and if you need to transfer them as a single entity, a Zip utility would take care of that (as well as reducing overall size for you). BTW Notepad is probably not the optimal answer to your question. :)

Luc Pattyn

amatbrewer

Thanks for the advice but that presents some problems of its own because there are other tools that have to make use of these logs. Most of the time I will be processing 3 log files at a time of around 900Mb each. Needless to say it takes a while...:zzz: You should see how long it takes to FTP them...every Monday. I could break the logs up into smaller files, but I still end up having to process the entire volume as a whole anyway, as well as change the setup of the other tools. So I would like to avoid this if I can.

David Wilkes

Luc Pattyn

Well, you could surely improve things: 1) before file transfer, try compression; again a ZIP utility is useful, even for a single file. On text files it would reduce size by a factor of about 3 to 5. 2) if you can modify the app that logs, you could leave everything as is, but add something that creates another file containing exactly (or approx) what you are really interested in. 3) I dont know what the underlying business logic is, but requiring that amount of text to be collected, transfered, and analyzed seems very strange to me. I would say the overall process deserves reconsidering. :)

Luc Pattyn

amatbrewer

This is a system that I sometimes think that Rube Goldberg designed. We are talking about a Cellular Phone system running on a UNIX platform (messing around on it is not an option). The available tools and interfaces are archaic at best, and my knowledge of UNIX…well lets say I know just enough to get into trouble. The volume of data while unfortunate is still only 1/3rd of what is actually collected, and I’ve reduce about as far as I can while still maintaining its integrity and validity. So, does it suck? Sure. But one of the reasons I like this job is that it is always a challenge…and this is today’s…tomorrow? Who knows, maybe I’ll have to solve Cold Fusion.

David Wilkes

spin vector

I've run large log files for new processes, full throttle on Debug level. Also, unknown or old code may have large log files. Size is not the problem here. What I've done is read the articles on making an xml file or database (after xml) of a logfile. There are good articles on using .NET regex to parse a file and put it into xml. An overview article, a bit light on details, is http://msdn2.microsoft.com/en-us/library/ms972965.aspx This way you can rationalize the files, normalize them to a dB, then really look at the contents. Good luck.

spin vector

Also, rereading your original post, you need some grammatical structure here. Read the articles on Yacc, or for .NET anything that is a Yacc-like parser. The article on http://www.codeproject.com/csharp/minossecc.asp seems helpful. Search around, I can't put my finger on it but there are other non-Java .NET parsing meta-languages to help with the file structure. There's always MKS Lex and Yacc -- I've used a long time ago to great effect, not too hard to learn (days). But, find something free if possible. Cheers.

amatbrewer

Finally got a chance to look at the link you provided. This looks like it will work. I never expected I’d be able to simply locate the desired data based upon its pattern, but I was able to write a RegEx using Expresso that does it (at least on a small sample log). Now all I need to do is code it and see if it will work for the big files. Thanks!

David Wilkes