Binary Data Storage

rcardare

I need some ideas or pointers on storing several hundred gigabytes of data. Currently, the files transferred range from 1 KB up to 600 MB and vary in sizes. What I would like to do is break the files down into small (8K?) blocks and index them with a hash. Then in a database, create a chain so the files can be reconstructed. I like this because it will allow duplicated blocks of data to be identified and reduce the size of the storage. Bad idea? How might a directory structure look to accomplish this so it isn’t impossible to enumerate through the files? I thought about storing the files blocks in a database as blobs but I think that would be too much strain on the database resources. SQL 2008 will have some nice features to accomplish this but it will be a year or two before we get there. Any ideas? Thanks in advance.

Mark Churchill

Whats wrong with the filesystem for storing your data, with a seperate index if necessary? We probably need more information on what type of data you have and how you intend to index it. When you said "I have a few hundred gig of files and I need to store them somehow" the first thing that comes to mind is NTFS ;)

Mark Churchill Director Dunn & Churchill

rcardare

Thanks for the response Mark. The system I am working on is a pub/sub that distributes files to multiple subscribers. It is an in-house system that transfers manufacturing data – application, documents, collected data, test results, etc. I currently have it where the data is uploaded to a file server in the sky and the subscribers then download – nothing too complicated. It is currently architected where the subscribers connect through a web farm (load balanced) to download small chunks of 64K until the transfer is complete. Each chunk is a new http request which is causing heavy I/O between the web servers and file server as it opens the file, reads until the requested chunk and then returns the data. It would be ideal to just stream the file using the same connection until the transfer is complete and resume with network hiccups. The problem here is that some of the third-party sites use older proxy servers which won't allow for that. In addition, there is a firewall(beyond my control) which limit the amount of time a connection is allowed to remain open. My new plan is to store the files in smaller chunks so it is more efficient when downloading so it doesn't have to navigate through the large files up to the position of the chunk being downloaded. I could also leverage this to reduce duplications of data stored on the server. Unfortunately with this design, I would then have to open a database connection to identify where the blocks of data are stored and how to piece them together. I would probably end up in a worse scenario with the IO to the database and calculating what to return. This is where I thought storing the data in the database might be better as I would already have a connection and I can query the data and return the exact requested chunk. I am torn here because of the associated costs for storing that much data in a database. I am just curious if there are any ideas. I probably should not worry about how the data is stored and work on solutions to solve the number of http requests. Thanks

led mike

rcardare wrote:

I probably should not worry about how the data is stored and work on solutions to solve the number of http requests.

Yes, it sounds to me your current solution is not using chunked-encoding, or not using it correctly.

Mark Churchill

Hi, It seems like the issue with some legacy clients not handling large downloads may be solvable by the clients requesting the data in chunks. This doesnt mean the data needs to be stored in chunks though ;) I'd solve the issues with how the clients retrieve the files first. Then you can see how appropriate your storage mechanism is (I'd suggest NTFS with additionally indexing in SQL Server would work fine.

Mark Churchill Director Dunn & Churchill

etkid84

Here is my two cents for what it's worth: avoid moving the data around, why not use the database to tell the client where the data is located? whereever there is data there should be services to provide the client with the "knowledge" about the data he is looking for. when you are working with large chunks of data, avoid moving it around, and especially across a network. make the local service perform as much work as possible and return an "answer".

David

ky_rerun

We had a similar problem with large file distribution. I wrote a system where a client would download the file and then send broadcast messages out so that other local hosts could pick up the file and download it from the lan. Also if your using windows clients look at the Background intelligent file transfer.

a programmer traped in a thugs body