Efficient way to read/write file

Greg Utas

Unless you have a lot of software experience, consider posting your specification (i.e. the actual problem that you're trying to solve). It might be that there are better approaches than having to write, and later read, a 900KB file every second.

Robust Services Core | Software Techniques for Lemmings | Articles

kalberts

manoharbalu wrote:

Please note that I am going to write the structure to the file for 4 days as a maximum period.

4 days is 345,600 seconds. Are you saying that you are going to write 311 gigabytes of data to disk in 4 days? (That is if you don't find a good way to compress it.) I certainly would not want to manage a single 311 GByte file. Nor would I want to manage 345,600 files ... Writing a megabyte per second to a modern disk is nothing to worry about. A modern processor could easily handle compression of a megabyte a second, too. A disk to handle a third of a terabyte is no problem. You don't need any super-technology to get this to work. I would be more worried about either handling a 300 GByte file, or a third of a million tiny files (yes, by modern standards, files < 1Mbyte are "tiny"). Compressing to save space is just a technical detail - at application level, you realate to it at its uncompressed size. That comes natural if data are, say, a video stream, but then you would never consider to spread it over a third of a million files. So I am really curious to hear about the nature of the data! I am not shaing CPallini's sceptisism to general compression algorithms. They work by the principle of recognizing recurring patterns, representing them by a shorter codeword. In totally unprocessed data there is often a large share of repeated patterns, especially in logging data. And in all sorts of readable text. Either, you'll come quite some way using a general method (such as pkzip), or it takes lots of detail knowledge of the information structures to do it much better. Obviously, if you data is a known media type, such as video, photo or sound, then the mehtods have already been developed; you can get hold of a library to do it - I am thinking of when you to handle you own datatypes for which no dedicated compression method is yet known. Try out general methods first. Only if the results are completely unsatifactory, the files still way too large, may you consider making something of your own. Maybe you can reduce the data volume before you send it to compression, factoring out information elements, omitting data that can be reconstructed from other data etc. Also consider whether you need losslss compression. If these 900 kB chunks are individual log entries, will they ever later be addressed as individual entries? Maybe you can reduce the data to statistics - averages, sums, whatever. (Like, when you swipe your

Joe Woodbury

Putting aside the data storage constraints mentioned by others, you need to experiment. 1) Is the structure bitwise copyable? 2) If so, just write it to disk using CFile and measure the impact. 2a) Use LZ4 to compress the data first (My own guess is that this would make a difference with a hard drive, but may actually take more time with an SSD.) 3) If not, you will need to serialize it; I'd still keep all the data binary. 3a) You could serialize it to a block of memory and then write that in one go (very likely the fastest way) 3b) You could serialize to a file using CStdioFile (depending on the data, this could be very slow. Even slower if you used C++ iostreams.) 3c) Do a combination; Serialize the data in 32k chunks and write them, perhaps asynchronously. (That said, when doing anything comparable, I prefer just having a second thread write synchronously.) If the data needs to be future proofed, consider serializing using RapidJSON (which is a very fast C++ JSON library), compressing the result with LZ4 and then writing that. However, this could easily take longer than a second, depending on what the data is. Edit: If the data is fairly regular, you might be able to save the full thing every ten seconds and differences in between. This can be tricky; I once worked on an app which did this and I found that the differencing alone exceeded the time it took to transmit the data over TCP.

kalberts

Joe Woodbury wrote:

If the data needs to be future proofed, consider serializing using RapidJSON (which is a very fast C++ JSON library), compressing the result with LZ4 and then writing that

Guaranteed to be a viable format forever, or five years, whichever comes first. During my career, I have seen shockingly many "future proof" formats come and go. I have come to adopt the attitude I met when working on a large project with digital libraries: Do not go for one standardized format, but let each library choose its own. When you, thirty years from now, need to recover some document, hopefully one of the different formats used is still recognized. Don't forget that when you pick up that file thirty years from now, that ST506 disk needs a machine with the proper interface. And you need software to interpret the disk layout and directory structures of the file system. You may need to understand all the metatdata associated with the data stream. The record structure in the file. The encoding of numerics and characters. The data schema. The semantics of the data. Sure, JSON is one of the fourteen everlasting syntax standards for how to represent a hierarchical structure. Ten years ago, it wasn't - it didn't exist. Some of those format used ten years ago are dead now; maybe JSON will be dead in ten years. Bottom line: Never choose a data representation because it will last forever. Or more than five years. If you have a need for that, make a plan for regularly move your data to a new disk (or other physical storage) format. Move it to a new machine with the proper interface for the physical unit. Move it to a new file system. A new character (/other low level) encoding. A new schema. A new concrete grammar. Be preparered for some information loss during each move. While having format n as the primary one, always preserve format n-1 (along with all hardware and software required to access it) until you switch to format n+1 - i.e. always have data available in two operational formats. Preferably generate format n+1 from format n-1, to avoid the losses from n-1 to n and further from n to n+1. But first of all: Don't trust DCT100 or Travan to be a format to last forever. Nor eSATA. Nor SGML. Nor EBCDIC. HFS. BER. YAML. ... For five years, it may be safe. Anything significantly beyond that is gambling.

Joe Woodbury

JSON will be around for a while. At the very least, it's very readable and a step up from plain text or csv. (And to be pedantic, JSON has been around for almost 20 years. XML has been around 24 years and it's predecessor, SGML, for 34 years.) Your rant is borderline senseless and unproductive. I use the slang "future proof" meaning it will last as long as the program and you know that; you are arguing for argument sake and preening while doing so. Moreover, your statement "Do not go for one standardized format, but let each library choose its own." is meaningless in this context--short lived files--and in the broader sense since it traps you back where you started, afraid to do anything lest it become obsolete. You are also mixing hardware protocols with file formats. Even obscure formats, such as BSON, would be readable in a lossless way fifty years from now as would a BMP. A plain text file of JSON or YAML is even more readable. JSON is one step above key/value pairs--how would you lose information? And, if moving from one disk to another, what does "proper interface for the physical unit" have to do with anything? It's a file. I have 30 year old text files; should I be panicking. Perhaps I should have kept them on floppy disks and kept a floppy disk reader and format n-1 (whatever that is for a text file.)

kalberts

I am happy to see that you have rock solid confidence in today's high fashions. Keep it up! We need enthusiasts. Nevertheless: When you work on digital library projects with the aim to preserve information (as contrasted to "data" and "files" - the semantic contents!) for at least one hundred years, maybe five hundred, then things come in a different light. To illustrate problems, I used to show the audience an XML file where everything was tagged in Northern Sami. It made little sense to anyone (except that Sami guy who had helped me make the file). So why couldn't the entire world speak English? My next example was one where a 'p' tag identified a 'person', a 'part' or a 'paragraph', depending on context. It makes little difference whether those are XML or JSON tags if they make no sense whatsoever to the reader, or are highly ambiguous unless you have extensive context information. Of course you can loose information! Say that you want to preserve a document where page formatting is essential (it could be for legal reasons, or other): For this digital library project, it didn't take me long to dig up seven different stragegies in use in various text processing systems for how to handle space between paragrahps in connection with page breaks. If you can "do an XKCD 927" and replace the 14 existing standards with a 15th replace them all, then good luck! Many have tried, none have succeeded. When you select a format, whether JSON, MS-Word, HTML, PDF or any other for your document storage, and convert documents in other formats to the chosen one, you will lose information. I could show you a pile of different physical media with my old files, totally inaccessible today. If you want to preserve data, you cannot simply say "forget about physical formats, interfaces, media - as long as we use JSON it is safe". No, it isn't. The Travan tape reader uses its own interface card, for the ISA bus. I no longer have a PC with an ISA bus. I've got one SCSI disk with a 50-pin connector, it is not the Centronix 50-pin but a "D"-connector with three rows of pins. I once had a 50-pin D-to-Centronics cable, but even if I had saved it, I have no SCSI-interface where it fits. I have got the full source code of the OS I was using, on an 8-in harddisk cartridge disk, but this millenium I haven't seen a reader for it. I still keep a Win98 PC alive: If I plug the USB floppy disk reader into a modern (XP or later) PC; it won't read floppies without a proper format code in the boot sector. Win98 could. Sometime

Joe Woodbury

The original poster isn't preserving information; he is saving it temporarily. This is all about temporary data. Right now you are just repeating the obvious. You also shifted your argument; you went from you will lose information to you can lose information. Then you introduce both complex documents and hardware into your argument. Nobody disagrees about the transience of hardware standards and physical media, so why preach on it? What does it have to do with a text or csv file I have from 1993? I have thirty year old source that was on floppy, tape backup, Zip drives, Jazz drives, various hard drives, CDROM, DVD-ROM and is now on an SSD and on OneDrive. It still compiles. Yet, you argued that I was all but guaranteed to lose information in each transfer; that no file format lasts. It has. (Granted, I'm the only one to care about that specific project, so when I die or get tired of keeping it around, it will vanish, but that's also true about almost everything I own--that's life.) I'm also a bit baffled by the claim that "ASCII was introduced as the final solution for all computers all over the world to interchange arbitrary data". No it wasn't. Your straw man collection is now complete! Or is it? :)

charlieg

I'm not sure where "member" is coming from. I get the point, but I'm not sure it's meaningful. This all boils down to requirements, and nothing has been offered by the OP except for "I need to write it fast." There is nothing to do with hardware here - it might be another requirement. As for the file format - document it and you're done. This is not rocket science. Custom file formats are all over the place, and as long as they are documented, you can always write a filter to change their format. OP, I recommend that you rig up a sample application that dumps dummy data at the frequency you need. See if it works. Fool around with it a bit. This is what you originally posted:

Quote:

1. How to save the structure (900 KB) to the file in reduced size format, such that the file size doesn't become too big? 2. Saving the structure every second data as an individual file for reading and writing (there will be 60 files for 60 seconds) or saving the structure sequentially in the same file(only one file)...Which is efficient for reading and writing Please note that I am going to write the structure to the file for 4 days as a maximum period.

Item 1: why do you care about file size? Are you on an embedded system or have restricted resources? Does this project limit your disk size? If not, size the drive accordingly, write the data and move on. Item 2: see my note about - *try* something. You can easily rig up code to test this. fwiw, you've not provided sufficient requirements for us to help you. You are going to have to worry about mutual file access, so you'll need to be thoughtful, but I don't see a performance concern at all.

Charlie Gilley <italic>Stuck in a dysfunctional matrix from which I must escape... "Where liberty dwells, there is my country." B. Franklin, 1783 “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759

kalberts

charlieg wrote:

This all boils down to requirements, and nothing has been offered by the OP except for "I need to write it fast." There is nothing to do with hardware here - it might be another requirement.

What spun off this sub-thread of the discussion was

Joe Woodbury wrote:

If the data needs to be future proofed, consider serializing using RapidJSON (which is a very fast C++ JSON library), compressing the result with LZ4 and then writing that.

"Future proof" was not stated as a requirement, but now that Joe Woodbury presented what is - in my eyes - a rather naive approach to future proofing, I chose to point out that if you want future proofing, it takes a lot more than just using a basic structure encoding that is currently fashonable. It seems quite obvious that Joe Woodbury has never been working in the area of long time information preservation. I have a few years of experience. I know that it is not a trivial issue. When someone makes a statement that suggests "Just use JSON and LZ4, and the information is safe for the future", I think that this is so naive that it crosses the border to "fake news", and I want to correct it. However, Joe Woodbury is not willing to accept anything that can affect the validity of his claim, calling my comments a "rant", "senseless and unproductive", that I am "mixing up" things by pointing to other important elements, that I am "just repeating the obvious". I wrote "Be preparered for some information loss during each move". When Joe Woodbury stated "JSON is one step above key/value pairs--how would you lose information?", I went on to provide examples. Then he comes back with "You also shifted your argument; you went from you will lose information to you can lose information", and concludes "Your straw man collection is now complete!" I don't think anything valuable will come out of a further discussion with Joe Woodbury. So I let it rest. What regards "just document it": I have seen guides seriously suggesting a URL to visit if you have problems connecting to the Internet. I have seen document format descriptions stored electronically in the format that is described. I have seen format "documentation" that is hopelessly inadequate - having worked with the format for a long time gives the documentation some value, but often you need access to the format designer to have him explain it. I have format descriptions on 5.25

charlieg

I catch what you are saying, it just seemed to me that for the context of the original question the discussion veered. No harm or foul.

Charlie Gilley <italic>Stuck in a dysfunctional matrix from which I must escape... "Where liberty dwells, there is my country." B. Franklin, 1783 “They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759