How to retrieve non fixed-length records from a binary file

htres

Hello, I'm new to programming and C#, so please bear with my ignorance! I need to extract jpeg images and header data from a binary file. The binary file is formatted with several fixed length fields containing information about the jpeg image, followed by the jpeg itself, followed by more header data, another jpeg, etc... Using a FileStream and BinaryReader I am able to read and store the metadata, because I know the length of the fields, but I am stumped on how to read and store the jpeg bytes since they vary in size. There is a fixed record delimiter between each header data/jpeg record, so I was thinking of using that to break apart the records. Once they are seperated and the header fields read, I could just assume the rest is the jpeg and store that. I'm not sure how to go about doing that though. Any suggestions or demo code is greatly appreciated! Thanks!

Luc Pattyn

Hi, assuming you are in control of the file format, this is what I suggest: prefix each record with a byte indicating the record type, and terminate the file with yet another record type (I suggest a zero byte here). So a file would look like this: type0 record0 type1 record1 type2 record2 ... typeN recordN 0 Now each record could correspond to a C# struct, and that struct could contain a Save(stream) method to append the struct as a record to the file stream (dont forget the type byte!), and a Load(stream) method to create and populate a struct by reading from the file stream (starting after the type byte). Now loading the file stream would consist of a loop containing - read the byte that tells the record type - use a switch to call the right struct's Load method - if end-code, close stream If you cant follow the above scheme (e.g. because the file format has been fixed and does not include a type byte), then you need to determine the type of the next record by reading and analyzing some bytes, then rewind a bit (using Seek method or Position property) and Load a record; repeat until done. Hope this helps. :)

Luc Pattyn [My Articles]

DavidAtAscent

If you have control over the binary file format, this may help. Make sure that one of the attributes in the fixed length header is the length of the following JPEG image data. You could then create a byte array (byte[]) of the length of the JPEG image and read the specified number of bytes to the byte array using: BinaryReader.Read(byte[], int index, int length)

htres

Thanks for the suggestions! Unfortunately, I am receiving this binary file from a third party and it's format is out of my control. :( The fields in the records seem to be seperated by a zero byte, and the records are seperated by a 16 byte string. Also, I noticed that the jpeg data starts with the bytes (in HEX) FF, D8, FF, E0 and ends with FF, D9. Could I possibly use these byte sequences to identify the jpeg? Also, I'm very new to this so all I have figured out how to do so far in code is to read my fixed length fields like so: FileStream fs = File.OpenRead(strFileName); BinaryReader reader = new BinaryReader(fs); //reads the first 36 bytes of trash reader.ReadBytes(36); //reads and stores the record delimiter string string strSignature = Encoding.ASCII.GetString(byteSignature); //advances the curser 1 byte reader.ReadBytes(1); //stores SKS ID string strSKSID = Encoding.ASCII.GetString(reader.ReadBytes(16)); //advances the curser 1 byte etc... reading down until I get to the image field.

Luc Pattyn wrote:

If you cant follow the above scheme (e.g. because the file format has been fixed and does not include a type byte), then you need to determine the type of the next record by reading and analyzing some bytes, then rewind a bit (using Seek method or Position property) and Load a record; repeat until done.

I'm not sure how to actually implement your suggestion of reading and analyzing some bytes, then rewinding. Could you provide some example code? Thanks again!

Luc Pattyn

Hi, since you dont control the file format, here are the fundamentals you will need, plus some suggestions:

- use one FileStream for your file

use BinaryReader.ReadBytes() to read a number of bytes at the current position (it will
advance the current position); problem here is you must specify the byte count
create a number of classes or structs, one for each possible record type.
if class/struct RecordType1 is one of the possible record types, you should give it
two static methods:
bool Accept(FileStream) would read some bytes and decide whether or not the data fits the
record type for that class/struct; it should restore the filestream position as if
nothing happened (use FileStream.Position property to remember where you are in the file,
and to return to that position); it should not throw exceptions to the caller.
RecordType1 Load(FileStream) would read all the bytes needed to load a record of that type,
knowing it is of that type (since it will have been Accepted beforehand). Load does
advance the filestream, so it consumes the record and returns the result. It should throw
exceptions when something fails.
details on Accept: you can try and recognize the first few bytes; JPEG always start
with FF D8 and often with FF D8 FF E0; but nothing prevents other (non JPEG) records
to also start with FF D8 !! So your collection of Accept() methods should be sufficiently
accurate to discern the record types at hand.
details on Load: you should know the byte count in order to read the right number
of bytes; scanning for an end marker is difficult: even if JPEG always ends on FF D9,
that does not mean the first FF D9 is the end of a JPEG (it could be a bit pattern
in the middle of the pixel info).
it is rather hard to decode JPEG, so I suggest to let GDI+ try and decode a JPEG image.
One way would be to create a memory stream from your byte array, then call
Image.FromStream(MemoryStream), but I suspect you could directly call
Image.FromStream(FileStream) avoiding the byte count problem completely.
you can create a new BinaryReader in every Accept and every Load method in each
RecordType class/struct, or reuse a single one all over the place (dont try something
intermediate).
also provide a class/struct to handle the end-of-file record; it needs an Accept
but does not need a Load() method.
and now the finale: put all your Accept and Load methods in one loop to decode the
entire file, as in:

try {
FileStream fs=...
for (

htres

Well, I was finally able to do this in a pretty efficient way. I just read in the entire file, about 256k, to a byte array. Then I could convert it to a string in order to use string.IndexOf to find the record delimiters. I then used those demlimiter positions and Array.Copy to copy what I wanted out of the original byte[] to it's own byte[]. From there it was easy to get the image because each record has a fixed 223 byte header, so the remainder of the record had to be the embedded image. I just copied what was left of the record after the first 223 bytes to another byte[] and wrote it to disk, named it .jpg, and tada, I had the jpeg image!

Luc Pattyn

Hi, I'm glad you got something working. I would not fully trust the string.IndexOf part, since string operations perform unpredictably on non-string data (such as JPEG images, which can contain any bit pattern, that could be misinterpreted as Unicode characters).

htres wrote:

each record has a fixed 223 byte header

That's new info, makes things easier I guess.

htres wrote:

wrote it to disk, named it .jpg, and tada, I had the jpeg image

As I mentioned earlier if you want the image I guess you can do it without such file using Image.FromStream(); if you need the file, then it is the way to go. :)

Luc Pattyn [My Articles]

htres

Luc Pattyn wrote:

I would not fully trust the string.IndexOf part, since string operations perform unpredictably on non-string data (such as JPEG images, which can contain any bit pattern, that could be misinterpreted as Unicode characters).

You are right, I had a lot of trouble with string.IndexOf when I was trying to isolate just the jpeg by searching for small strings of 2-4 chars. But after a lot of testing using it to find the record delimiter, which is the same 16 byte string in every record, it works very reliably. Even though it is entirely possible for this particular 16 byte string to show up within the jpeg encoding, the odds are against it.

Luc Pattyn wrote:

htres wrote: each record has a fixed 223 byte header That's new info, makes things easier I guess.

Yeah it was a lot easier. That 223 byte header contained the fixed length fields that held the information about the image. I probably should have posted an example of the file format...but it would have been ugly since it is mostly binary.

Luc Pattyn wrote:

htres wrote: wrote it to disk, named it .jpg, and tada, I had the jpeg image As I mentioned earlier if you want the image I guess you can do it without such file using Image.FromStream(); if you need the file, then it is the way to go.

I haven't tried the Image.FromStream option yet, though I plan to eventually. Ultimately I'd like to populate a database with the picture and header data as well. But I'll leave that part for a new thread... ;) Thanks for the help!