Code efficiency

Evgeni57

Hi all I need to read some data from binary file that it's structure like the following: record size, record... Reading the size, then the record etc. not efficiently. Which of the ways more efficient: 1)Reading to fixed size buffer block of data, and any time the part of the record was cut I'm using fseek and rewind position indicator associated with the stream backward. Example: I have 16K buffer, and each record has arount 2.5K, so I process about 6 records, then calling fssek(fp, total_processed_bytes, SEEK_SET); 2)Doing the same with two buffers but without fseek calls. 3)Some other idea? Thanks.

Robert C Cartaino

Evgeni57 wrote:

Which of the ways more efficient: 1)Reading to fixed size buffer block of data... 2)Doing the same with two buffers...

Neither. I don't think you will get better performance by reading into your own buffer (although you might want to read up on the Stream I/O[^] functions you are using). Your calls to fread()/fseek() (or whatever you are using) are already buffered. So, not only does the library call have to read data from the disk and into its internal buffer, but then it has to copy it out of that internal buffer and into a buffer that you specify. The reading sequence you described is always linear and forward reading (not a whole bunch of random access). Creating and managing an explicit, home-brew buffering system is actually likely to be less efficient, more error-prone, and take longer to code/test than just reading the information you need and moving on that way. Your first call to fread() will buffer up a block of memory internally. Subsequent calls will then read from the buffered data (for performance) and get new data from the device (when needed). Of course, the best way to proceed is to code it in the simplest manner to get it working; test it for accuracy; then determine if you need better performance and then looking for alternatives. Enjoy, Robert C. Cartaino

Mark Salsbery

Why do you need to do any seeking/rewinding? You should be able to read that data entirely sequentially. If you're adding your own buffering on top of the buffering already being done, then that's probably just making the process less efficient. You may be able to find a more efficient buffer size for your data than the default 4K, but beyond that I would think just reading record size and record pairs would be most efficient. Mark

Mark Salsbery Microsoft MVP - Visual C++ :java:

Evgeni57

Hi, thanks for your replay.

Why do you need to do any seeking/rewinding?

Because data's size not fixed. Once record may be 60 bytes long, and once 2847 bytes long. If I read any time 8192 bytes - size of disk's cluster, that I'm working with, then some records will be missed and seeking/rewinding or using my own buffering is essential. That's how it looks like for me.

I would think just reading record size and record pairs would be most efficient.

It will increase the number of the disk's accesses so I don't think it will be more efficient.

Mark Salsbery

Evgeni57 wrote:

It will increase the number of the disk's accesses so I don't think it will be more efficient.

No, it won't. If you're using the fopen/fseek/etc. family of I/O functions, then data is being read in 4K (by defult) chunks.

Evgeni57 wrote:

If I read any time 8192 bytes - size of disk's cluster, that I'm working with, then some records will be missed and seeking/rewinding or using my own buffering is essential.

Assuming doing your own buffering (which currently is an inefficient solution) you would parse records from your buffer until there's not a complete record left. Then you shift the remaining contents of the buffer to the beginning of the buffer and read more bytes from disk. There's no rewinding or seeking backward required, and to do so would be inefficient - you'd be reading the same data multiple times. Given what you've described as your file spec, this is completely unnecessary. If you intend to do your own buffering, then you should use UNBUFFERED I/O, not fopen/fread/fseek/etc. You've got a good file format for sequential reading, which is always the fastest. Why are you trying to complicate that? Do some benchmarks - I would bet simply looping using fread to read a record length followed by an fread to read the record data will be more efficient than anything you come up with involving random access (seeking backward). You can then tune the default buffer size (see setvbuf()) to something more appropriate for your typical record size. For example, reading 20 byte records with the default 4K buffer is pretty efficient. Reading 8K size records would do better with a larger buffer, maybe 32K. Mark

Mark Salsbery Microsoft MVP - Visual C++ :java: