Storing huge numbers of files
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
A few years ago I worked on a system that generates around 50.000 to 100.000 files a day. We ran in trouble right away. Storing the files was not a problem, but retrieving them was impossible. And a second problem was that we needed to search the contents of the files to find all files with a certain string in the text. We eventually choose to store all files in a database. This was quite easy because the files were small. (Less than 10K) We choose an Oracle database because of the CLOB datatype. (it allows for indexing and searching) We had no problems since and have more the 200 million files.:cool:
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
I worked on a system that had to stream 1MB images to disk at 75fps. I found that once there were about 700 files in a directory, creating new files suddenly became slower and the required transfer rate was unachievable. I ended up creating a new subdirectory every 500 files. Of course this won't be a problem if your system is purely for archive.
-
Windows explorer will be your bottleneck ... while you sit and wait while it "builds" a 100k tree view. Odds are, it will "hang". "Reading" directories is not a big deal; how you "display" them is.
It was only in wine that he laid down no limit for himself, but he did not allow himself to be confused by it. ― Confucian Analects: Rules of Confucius about his food
-
Can you provide an explanation of why it would be that way? Or is it at the "gut feeling" level?
Probably because a windows folder isn't designed to contained 10,000 files, unlike database table which is expected to contain millions of rows. Or spreadsheets. When we browse into a folder using windows explorer, it tried to read all file names inside that folder. There's no virtualization or partial loading. Reading 10,000 file names and extensions is surely detrimental. EDIT : it's probably fine as long as you don't browse it using any explorer view.
-
Can you provide an explanation of why it would be that way? Or is it at the "gut feeling" level?
-
Why would the size of the files matter? Very few are small enough to fit in the available space of the directory entry. Yes, they are files, by definition. Mostly, new files are added to the directory. This is the most common access. File access is far more infrequent.
-
File-access; so, mostly reading "files"? A database would give you the most flexibility and performance. --edit You can easily expand Sql Server over multiple servers if need be, with more control over sharding and backups than with a regular filesystem.
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
-
People don’t relate well to numbers and this is a place where camaraderie is important. A name - even an obvious alias will make the interactions more personable. ;)
If you can't laugh at yourself - ask me and I will do it for you.
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
I don't know about access issues for a large number of files in a directory, but you might also consider security issues. If, for example, you have several different users whose files should not be accessible by the others, creating a subfolder for each user might allow you to secure them such that only their user has access to their subfolder (plus maybe some 'admin' user that you use which can see all directories). Obvious organizational advantages as well.
-
Can you provide an explanation of why it would be that way? Or is it at the "gut feeling" level?
-
Too lazy to use italics. "Builds": iterating and instantiating. "Hangs": no response or exceeding an acceptable response time. "reading": file i/o "display": where one loads a visual element for each file object. Better?
It was only in wine that he laid down no limit for himself, but he did not allow himself to be confused by it. ― Confucian Analects: Rules of Confucius about his food
-
If users would be copying these files to a USB stick for any reason, you may run into a problem as formatting a stick using FAT32 is a distinct possibility.
Asking questions is a skill CodeProject Forum Guidelines Google: C# How to debug code Seriously, go read these articles.
Dave KreskowiakYou could always format the USB stick with NTFS.
-
Too lazy to use italics. "Builds": iterating and instantiating. "Hangs": no response or exceeding an acceptable response time. "reading": file i/o "display": where one loads a visual element for each file object. Better?
It was only in wine that he laid down no limit for himself, but he did not allow himself to be confused by it. ― Confucian Analects: Rules of Confucius about his food
-
You could always format the USB stick with NTFS.
You could, but how many users actually read the documentation for your app?
Asking questions is a skill CodeProject Forum Guidelines Google: C# How to debug code Seriously, go read these articles.
Dave Kreskowiak -
Which is not the same as using a database. The Dokan libraries have proven that a DB is very capable as a FS.
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
It really depends on your use case for accessing/managing these files. If you're going to be enumerating the files a lot (or portions of the files) then everything in one directory/folder may not be the best. You can at least "chunk up" the enumeration by subfolder if you create those. Also, if you break them up into subfolders in some logical way, then managing those units and/or groupings of files will become much easier. I.E. Backups, restoring, archiving, deleting. If you are storing the path to each file in a database, then you're going to get the same performance either way (subdirectories and everyone in the pool together). Can you explain a little more about the repository and how you'll be using it?
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
-
This is about file systems in general, although with a primary emphasis on NTFS: If you are expecting to stor a huge number of files - in the order of 100 k or more - on a disk, is there any significant advantage of spreading them over a number of subdirectories (based on some sort of hash)? Or are modern file systems capable of handling a huge number of files in a single level directory?` If there are reasons to distribute the files over a series of subdirectories, what are the reasons (/explanations) why it would be an advantage? Is this differnent e.g among differnt FAT variants, and with NTFS?
You might get by if using SSDs, or files are large and accessed directly & infrequently, and won't increase by orders of magnitude. Better to spread them out. Huge directories in NTFS: * Accessing individual files is OK * Adding/removing/listing/sorting gets slow (consider EnumerateFiles instead of GetFiles) * Reading metadata (mod date) is slow (makes Explorer detail view slow) * Network access is slower * Defragging directories (with contig) helps some (also moving large dirs with robocopy /create) Directories (and empty/tiny files) are stored in the MFT. A massive number of MFT entries can be a problem. The MFT starting size is set when (and only when) you format the disk (controlled by a registry key). It will expand if needed (but fragment), and will contract (if possible) when space is low. Defragging MFT is possible but slow and difficult. After a disk was full of files, or had the MFT filled by directories or tiny files -- it may be best to reformat. How to segment depends on how sparse the file IDs will be. About 4k entries is a good starting target. If files have numeric IDs: Avoid bit shifts, for simplicity. Group into 3 digits (base 10) = 1000 files + 1000 subdirs or 3 hex chars 0xFFF = upto 4k files + 4k subdirs eg. 000/0.dat - 999.dat 001/1000.dat - 1999.dat 999/999000.dat - 999999.dat ... 000/001/1000000.dat - 1000999.dat 001/001/1001000.dat - 1001999.dat 123/987/987123000.dat - 987123999.dat
-
It's from writing too many User Manuals. As in: the CD disk "tray" is not a "cup holder." Glad to know you and your users are more sophisticated and have time to sweat this stuff.
It was only in wine that he laid down no limit for himself, but he did not allow himself to be confused by it. ― Confucian Analects: Rules of Confucius about his food
-
Why would the size of the files matter? Very few are small enough to fit in the available space of the directory entry. Yes, they are files, by definition. Mostly, new files are added to the directory. This is the most common access. File access is far more infrequent.
File size is critically important. If you're breaking across the block size by just a little bit, the rest of the block is dead space. Assuming 4k block size and files storing 1K of data. That's 3k of wasted space on disk, per file. If you zip up the files, they'll store much, much more efficiently. We have this problem with hundreds of thousands of small text files. We sweep them up and zip them into archive folders on occasion to clean up the folders and reclaim disk space.