File I/O: What is the best approach
-
I am currently involved a project that requires that I have random access to multiple files on disk. I have a single 'file writer' object that handles writes, but I am unsure about how to proceed with reads. The question is: If I want to be able to service multiple 'reads' at the same time to a single file should I have a single object (an fstream) that is synchronized (using a lock) or multiple fstream objects that are independent. I want to take advantage of my raid hardware as well as multiple processors throughout the application. My initial thought is that having multiple 'reader' objects leaves synchronization up to the OS, and that using some type of locking mechanism (such as a criticalsection/mutex) could slow performance. Any help here is much appreciated. On a side note: If I'm just writing buffers of data (or 1 byte aligned structures) does it make more sense to just use stdio functions?
-
I am currently involved a project that requires that I have random access to multiple files on disk. I have a single 'file writer' object that handles writes, but I am unsure about how to proceed with reads. The question is: If I want to be able to service multiple 'reads' at the same time to a single file should I have a single object (an fstream) that is synchronized (using a lock) or multiple fstream objects that are independent. I want to take advantage of my raid hardware as well as multiple processors throughout the application. My initial thought is that having multiple 'reader' objects leaves synchronization up to the OS, and that using some type of locking mechanism (such as a criticalsection/mutex) could slow performance. Any help here is much appreciated. On a side note: If I'm just writing buffers of data (or 1 byte aligned structures) does it make more sense to just use stdio functions?
KenThompson wrote:
My initial thought is that having multiple 'reader' objects leaves synchronization up to the OS
What OS and where is it documented that it performs such synchronization. And by this I assume you mean synchronization to the "write operations".
KenThompson wrote:
and that using some type of locking mechanism (such as a criticalsection/mutex) could slow performance.
"would" slow performance, it's not in question. However without the OS performing synchronization, of which I am unaware, you would have to do it to avoid reading corrupted data. You could optimize by creating a far more complex mechanism like Databases do of managing what "parts" of the file are locked for writing. Which of course raises the question of why you wouldn't just just use a database because they already implement everything you require. Also you are not clear if this is across threads or across processes. The later synchronization is far more expensive.
-
KenThompson wrote:
My initial thought is that having multiple 'reader' objects leaves synchronization up to the OS
What OS and where is it documented that it performs such synchronization. And by this I assume you mean synchronization to the "write operations".
KenThompson wrote:
and that using some type of locking mechanism (such as a criticalsection/mutex) could slow performance.
"would" slow performance, it's not in question. However without the OS performing synchronization, of which I am unaware, you would have to do it to avoid reading corrupted data. You could optimize by creating a far more complex mechanism like Databases do of managing what "parts" of the file are locked for writing. Which of course raises the question of why you wouldn't just just use a database because they already implement everything you require. Also you are not clear if this is across threads or across processes. The later synchronization is far more expensive.
I'm already synchronizing write and read operations. By this I mean that I keep track of what is currently being done to the file. Basically, the writer never goes backwards, so whatever has been written is fair game in regards to reading. The only random access is reading. A database for this application isn't acceptable. The question remains though. What approach is the best? Have a single reader per file handling many requests. (ie. setg to the offset) or having several fstream objects created that read independently, in a shared mode. I didn't mean that the OS, in this case Windows, prevents corruption when modifying files. I should of been more clear in my statement. I meant to say: My initial thought is that having multiple 'reader' objects is perfectly acceptable and not a performance hit. In addition, in a raid situation would it not make more sense to create multiple file streams to the same file due to the very nature of multiple disk heads? I'm not all that aware of where there is any performance to gain based on implementation. I can only assume that if I issue two reads to the same file, via two streams, that the raid controller (in my case raid 5) would out perform a setg operation. Maybe not with 2 reads, but maybe 100s of reads per second. Does this make sense?
-
I'm already synchronizing write and read operations. By this I mean that I keep track of what is currently being done to the file. Basically, the writer never goes backwards, so whatever has been written is fair game in regards to reading. The only random access is reading. A database for this application isn't acceptable. The question remains though. What approach is the best? Have a single reader per file handling many requests. (ie. setg to the offset) or having several fstream objects created that read independently, in a shared mode. I didn't mean that the OS, in this case Windows, prevents corruption when modifying files. I should of been more clear in my statement. I meant to say: My initial thought is that having multiple 'reader' objects is perfectly acceptable and not a performance hit. In addition, in a raid situation would it not make more sense to create multiple file streams to the same file due to the very nature of multiple disk heads? I'm not all that aware of where there is any performance to gain based on implementation. I can only assume that if I issue two reads to the same file, via two streams, that the raid controller (in my case raid 5) would out perform a setg operation. Maybe not with 2 reads, but maybe 100s of reads per second. Does this make sense?
KenThompson wrote:
Does this make sense?
Absolutely. It would seem, on the surface (not a valid assessment), that multiple readers would be much more efficient. You might consider just implementing that and then run a profiler and see if that area even stands out at all. Sometimes (usually) profiler results will surprise you. ;)
-
KenThompson wrote:
Does this make sense?
Absolutely. It would seem, on the surface (not a valid assessment), that multiple readers would be much more efficient. You might consider just implementing that and then run a profiler and see if that area even stands out at all. Sometimes (usually) profiler results will surprise you. ;)
Thanks, I needed some sanity here. What type of profiler do you use? I've been playing with a few intel products, but the cost sours me to them.
-
Thanks, I needed some sanity here. What type of profiler do you use? I've been playing with a few intel products, but the cost sours me to them.
-
I am currently involved a project that requires that I have random access to multiple files on disk. I have a single 'file writer' object that handles writes, but I am unsure about how to proceed with reads. The question is: If I want to be able to service multiple 'reads' at the same time to a single file should I have a single object (an fstream) that is synchronized (using a lock) or multiple fstream objects that are independent. I want to take advantage of my raid hardware as well as multiple processors throughout the application. My initial thought is that having multiple 'reader' objects leaves synchronization up to the OS, and that using some type of locking mechanism (such as a criticalsection/mutex) could slow performance. Any help here is much appreciated. On a side note: If I'm just writing buffers of data (or 1 byte aligned structures) does it make more sense to just use stdio functions?
IOCP - not just for sockets, they can be used with any IFS based HANDLE e.g. sockets, files, pipes, ... Scatter/Gather I/O functions: http://msdn2.microsoft.com/en-us/library/aa365472.aspx[^] http://en.wikipedia.org/wiki/Vectored_I/O[^]
...cmk The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying. - John Carmack
-
I'm already synchronizing write and read operations. By this I mean that I keep track of what is currently being done to the file. Basically, the writer never goes backwards, so whatever has been written is fair game in regards to reading. The only random access is reading. A database for this application isn't acceptable. The question remains though. What approach is the best? Have a single reader per file handling many requests. (ie. setg to the offset) or having several fstream objects created that read independently, in a shared mode. I didn't mean that the OS, in this case Windows, prevents corruption when modifying files. I should of been more clear in my statement. I meant to say: My initial thought is that having multiple 'reader' objects is perfectly acceptable and not a performance hit. In addition, in a raid situation would it not make more sense to create multiple file streams to the same file due to the very nature of multiple disk heads? I'm not all that aware of where there is any performance to gain based on implementation. I can only assume that if I issue two reads to the same file, via two streams, that the raid controller (in my case raid 5) would out perform a setg operation. Maybe not with 2 reads, but maybe 100s of reads per second. Does this make sense?
KenThompson wrote:
A database for this application isn't acceptable.
So its better to write half of a database on your own instead of using e.g. SQLlite[^]?
KenThompson wrote:
Does this make sense?
I think no. You would only ever work on buffers managed by the OS. The Harddisk-Heads would ascend and decend its cylinders all the time, doing the read/write out of order.
Though I speak with the tongues of men and of angels, and have not money, I am become as a sounding brass, or a tinkling cymbal.
George Orwell, "Keep the Aspidistra Flying", Opening words -
KenThompson wrote:
A database for this application isn't acceptable.
So its better to write half of a database on your own instead of using e.g. SQLlite[^]?
KenThompson wrote:
Does this make sense?
I think no. You would only ever work on buffers managed by the OS. The Harddisk-Heads would ascend and decend its cylinders all the time, doing the read/write out of order.
Though I speak with the tongues of men and of angels, and have not money, I am become as a sounding brass, or a tinkling cymbal.
George Orwell, "Keep the Aspidistra Flying", Opening wordsSQLite is designed for use with databases sized in kilobytes or megabytes not gigabytes. Therefore, it is unacceptable.
-
IOCP - not just for sockets, they can be used with any IFS based HANDLE e.g. sockets, files, pipes, ... Scatter/Gather I/O functions: http://msdn2.microsoft.com/en-us/library/aa365472.aspx[^] http://en.wikipedia.org/wiki/Vectored_I/O[^]
...cmk The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying. - John Carmack
Thank you, this suggestion has proved quite fruitful! Ken