NTFS, win7 and 20 million files

Michael Pauli

(hope this forum is right then :-) Hi all, I'm about to begin a small project in which I must be able to store and lookup up to 20. mio. files - in the best possible way. Needless to say fast. For this I have been around - http://en.wikipedia.org/wiki/NTFS#Limitations http://www.ntfs.com/ntfs\_vs\_fat.htm And now my question: Dealing with a production load in the area around 60.000 files (=pictures) per day each around 300 kb in size, what would be the best ratio of what number of files in what number of directorys to make the search-time best? Obviously I do not put all the files in one directory, but in a number of dir's. So what would be the best economy for such a thing? Seems to be hard to find information about on the web. Thanx' in advance, Kind regards,

Michael Pauli

Luc Pattyn

Hi, 1. I tend to limit the number of files per folder to 50 ot 100. In my experience it is not very relevant if you never need to browse the folder with say Windows Explorer, so when your app knows which file to access, it does not matter. If you can group the files logically (say by topic), then by all means do so. OTOH if you have to open the folder in Explorer, especially on a remote computer, things may slow down considerably when the folder holds hundreds of files/folders or more. If so, use a two-stage or three-stage organization; with maximum N files/folder, that can hold N*N or N*N*N files. 2. Search what? file content? file names? partial file names? If file names, then again, organize a multi-level folder hierarchy based on what matters most to you (could be the first and second character of the file names). 3. Whatever is is you really need, just give it a try. In a matter of minutes a test app could create and store a huge number of files (real or dummy), and you could experiment with the result. :) PS: I'm sure all this is in the wrong forum, it isn't hardware related, is it?

Luc Pattyn [My Articles] Nil Volentibus Arduum

Jorgen Andersson

Seriously, use a database instead. You're losing very little storage space and winning so much on the lookup. Atleast if it's properly indexed.

List of common misconceptions

jschell

In terms of general design.... Following site is nice for articles on exactly what the name suggests. http://highscalability.com/[^] Here is one that you might find more specifically relevant. There are others there about flickr as well. http://highscalability.com/flickr-architecture[^]

Michael Pauli wrote:

in the best possible way. Needless to say fast.

((8 hour business day) * 3600 seconds/hour)/60,000 = 0.48 requests per second. That by itself doesn't require much of a "fast" look up. And exactly what the look up consists of is probably more relevant. Using a file architecture is probably more relevant to accessing it rather than looking it up. And once you have it you must still serve it back to the caller. Which is going to be a non-trivial cost. If one uses a direct url mapping then there are probably other optimization strategies such as some sort of grouping in terms of where pics are on the hard drive versus attempting to optimize directory size which would provide a more measurable impact. Although I wonder if it would be significant. I would also expect such strategies to be impacted (if measurable) by the actual hard drive chosen.

Michael Pauli

Hi Jörgen, I totally agree about your comment, but my customer want to use a filesystem and not a Oracle db etc. I really don't understand why, but it is something about maintenance and backup I'm told. Kind regards,

Michael Pauli

Dave Kreskowiak

Yeah, that's utter bullshit. Your customer is going to find that that method will be non-performant and limited as well as very easy to screw up while doing "maintenance". The more files and directories you shove into the directory structure, the slower a single search is going to get. Indexing won't help much as the indexes will be limited to the properties of the files themselves as well as the metadata stored in the image files. The more files and directories you add is going to make the NTFS data structures grow and grow, eventually taking up gigabytes of space, slowing your machines boot time, and if something should happen to those tables, God help you when performing a CHKDSK on it. Bring a cot to sleep on. The backup argument is also garbage as it's just easy to backup a database as it is to backup the massive pile of debris you're about to litter the drive with.

A guide to posting questions on CodeProject[^]
Dave Kreskowiak

Jorgen Andersson

Very nice and clear summary.

List of common misconceptions

Jorgen Andersson

Maintenance and backup are among the best reasons to use a database. Tell them to educate their staff. Daves summary is spot on in my opinion.

List of common misconceptions

jschell

Dave Kreskowiak wrote:

The more files and directories you shove into the directory structure, the slower a single search is going to get. Indexing won't help much as the indexes will be limited to the properties of the files themselves as well as the metadata stored in the image files.

Not sure I understand that. I am rather certain that both MS SQL Server and Oracle provide for a file based blob storage mechanism. And of course using a url string for a blob entry is an option for any database. There are tradeoffs as to whether one wants to keep it in the database or file system. And it isn't that hard to implement at least some simplistic indexing schemes if one doesn't want to use a database. That requires using another file, it doesn't require searching the files themselves. And if one was using a database then one would still have to export the meta data from the files. If one didn't then I wouldn't be surprised if attempting to extract meta data from image blobs would be slower with a database.

Dave Kreskowiak wrote:

The more files and directories you add is going to make the NTFS data structures grow and grow, eventually taking up gigabytes of space, slowing your machines boot time

What does the storage requirement have to do with anything? If you store something in a database it takes space too. I have never heard anyone make that claim about any OS slowing down. Could you provide a reference?

Dave Kreskowiak

jschell wrote:

Not sure I understand that.

In order to search 20,000,000 files and have a request return something in your lifetime, you better have the Indexing service turned on and your app better be using it. Check the OP. He's specifically avoiding using a database because of stupid customer requirements.

jschell wrote:

What does the storage requirement have to do with anything?

The size of the NTFS tables on disk grows and grows with the number of files and folders you stick in the volume. Directory entries take up on space on the disk. Not so much if you put everything into a database since the database is only a few files.

jschell wrote:

I have never heard anyone make that claim about any OS slowing down. Could you provide a reference?

Don't have to. Think about it. The NTFS tables take up memory. The bigger you make those tables, the more memory that's going to be eaten up and less available for apps. Of course, what affect this has depends on how much memory is in the machine. I meant to say that the server will take longer and longer to boot, not necessarily slow down the app once everything is loaded and running. You want documentation? Try it yourself. Load up your C: drive with 20,000,000 files in a few thousand folders, reboot your machine and watch what happens. To take it a bit further, try scheduling a CHKDSK and reboot. Don't forget to have a pot of coffee standing by.

A guide to posting questions on CodeProject[^]
Dave Kreskowiak

Michael Pauli

Hi Dave! Thank you for your opinion. I must say I tend to go your way here, but to avoid any problems of a more political nature I go for the file sys. solution. I my career I've never done a thing like that and I find it hard to write - even though it's simplistic by nature. To begin with we go for 500 directories each holding 500 sub dirs. each holding 500 sub dirs. That is 500³ = 125,000,000. I'm having a server for this so it's not on my locale dev. pc. :-) My feeling is that we would be better off having an Oracle db or likewise for it. But the decision is made :-( Thanx' again. Kind regards,

Michael Pauli

Michael Pauli

Yaa sure - I agree but some technicians here would like to have this filebased and not put in a database for some more or less obscure reasons. So they get what they want. I have less than a week left on this assignment ... if you get my point ;-) Kind regards,

Michael Pauli

jschell

Dave Kreskowiak wrote:

In order to search 20,000,000 files and have a request return something in your lifetime, you better have the Indexing service turned on and your app better be using it.

In order to search the image data of 20 million blobs in a database it is going to take just as long and probably longer. The only way to avoid that in the database is to extract the meta data from the images and store it somewhere else in the database. And again one can do exactly the same thing with a file based system.

Dave Kreskowiak wrote:

The size of the NTFS tables on disk grows and grows with the number of files and folders you stick in the volume. Directory entries take up on space on the disk.

The size of the database on the disk grows with the number of blobs you stick in it. So how exactly is that different?

Dave Kreskowiak wrote:

Don't have to. Think about it. The NTFS tables take up memory. The bigger you make those tables, the more memory that's going to be eaten up and less available for apps. Of course, what affect this has depends on how much memory is in the machine.

That isn't how any modern file system works. It doesn't load the entire file system into memory. Matter of fact the database is going to load more into memory than the file system will. Quite a bit more unless you constrain it. Not that it would matter anyways since it would be using virtual memory.

Dave Kreskowiak wrote:

I meant to say that the server will take longer and longer to boot

That clarifies it for me - I don't believe that. Please provide a reference. Provide a reference that refers to booting the machine. (Since I was interested I also determined that I have over 500,000 files on my personal development computer. If there was in fact some impact then I would certainly expect that a server class machine with a server class file system would in fact be able to handle more files than a personal dev box.)

Lost User

Jörgen Andersson wrote:

You're losing very little storage space and winning so much on the lookup.

+5, there's probably little worries about fragmentation as the pictures do not change, and it'd be the fastest solution to retrieve a blob :thumbsup:

Bastard Programmer from Hell :suss:

puromtec1

Maybe a document repository tool would be a good solution. Here is one that I use extensively, although just the free version os far. You would need the full version to allow more concurrent users and unlimited document count. http://www.m-files.com/eng/home.asp[^] The repository offers classification of files, fast searching and pretty much considers the "physical" location irrelevant and instead considers all documents to be in the "bag". Also, a big plus, it doesn't use window's network mapping (or whatever its formal name is), it is instead over tcp (IMGIC).