Best way to store MASSIVE amounts of data?
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
That sort of volume is going to required some serious iron. I would suggest you need to look at Oracle, I loathe Oracle but SQL server struggles with serious volume. We were looking at record volume about the same but data size substantially smaller and I think the 3 server cost topped $1m for the Oracle licences. Then you are going to want to hire an Oracle consultant/DBA to design and tune the blasted thing.
Never underestimate the power of human stupidity RAH
-
Hmm... you're right haha... my estimates were a bit over the top. I did the math just now, and thats like 95,000 TB... lol... I think 10TB - 100TB total data is more in the right area... oops :)
-
That sort of volume is going to required some serious iron. I would suggest you need to look at Oracle, I loathe Oracle but SQL server struggles with serious volume. We were looking at record volume about the same but data size substantially smaller and I think the 3 server cost topped $1m for the Oracle licences. Then you are going to want to hire an Oracle consultant/DBA to design and tune the blasted thing.
Never underestimate the power of human stupidity RAH
As I responded to the other guy, I got a little over zealous with my estimates... 10TB to 100TB is probably closer.
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
ok, there's the images, and then there's the 'meta data', or think of the index fields telling you how to locate an image .. So, to locate the correct image or pointer to image, you need the metadata - this is what you search on .. the question is, do you really need to store the images WITH the meta/index data ? I'd suggest 'not' .. so, you have a searchable DB of metadata/indexes on fast tier 1 storage, then, once the right record is round, you then retrieve the image from maybe a worm drive, or 'tier 2 storage' using the pointer/location from the search. Plenty of storage providers provide WORM storage for example
-
As I responded to the other guy, I got a little over zealous with my estimates... 10TB to 100TB is probably closer.
If the number of records is over 100m then you are going to struggle with SQL Server, even when you go down the path suggested by Garth - that is really your only option anyway. There is no way you want the images anywhere near your searchable data.
Never underestimate the power of human stupidity RAH
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
SledgeHammer01 wrote:
Performant lookup is also desired.
You need actual requirements - not off the cuff statements. For example you could take the above statement and claim that the the system must be capable of serving all of that data at the very same time. If so then the company is going to need to buy an IP backbone company just to deliver it. The reality is that system has business cases that dictate usage. You need to start with those.
SledgeHammer01 wrote:
Needs to be backupable of course as well.
Think about that in terms of the above requirements. How long is it going to take to restore the entire system from scratch? Obviously too long some something has to be different.
SledgeHammer01 wrote:
I guess at a high level, they could be split by state.
You start with requirements, business cases and usage patterns and then define categorizations. Categorizations will impact how it is stored. One final suggestion - if that estimate is a pie in the sky dream then go do your own research. If it is a hard business requirement then the business needs to pay for consultants (very likely plural) with experience in very large data systems rather than trying to roll their own.
-
What is the best way to store *MASSIVE* amounts of data? I'm thinking [EDITED] millions or billions of images totalling 10TB or so. Not sure of the exact size yet... need to get some data samples... but somewhere in that arena. I guess at a high level, they could be split by state. I know that SQL can not handle this kind of size. Performant lookup is also desired. How is this usually handled? Needs to be backupable of course as well. Had an old boss who was a big fan of storing the path in the DB and the files on disk. From personal experience, you end up with so many directories, the file system breaks down. Try navigating to a folder with 1000+ directories. I don't think I'll need to do too complex of queries on the data, just simple look ups. Inserts should be fast as well.
SQL is a language. I have a few hundred gigs that I generate google map tile images on the fly so it can work. Appropriate indexing and caching is the real key. How it is usually handled? Testing, scaling, testing, and measuring. Are you on an intranet exlusively? Inexpensive servers with 4 port NICS serve as really nice in-house CDN's and scale fairly well. Not on an intranet? Well there are public CDN which will host the images then it is merely about hosting the indexes.
Need custom software developed? I do custom programming based primarily on MS tools with an emphasis on C# development and consulting. "And they, since they Were not the one dead, turned to their affairs" -- Robert Frost "All users always want Excel" --Ennis Lynch