This might be a silly thing to ask, but why don't they save their data in flat f...

hboon · on April 18, 2013

Storing, tracking and managing billions of tiny files directly on a file system is a nightmare

mcartyem · on April 18, 2013

What about it is a nightmare?

Isn't storing, tracking, and managing billions of entries done directly in a database anyway?

boyter · on April 19, 2013

Its a real pain when you want to inspect the files, delete or copy them.

Try taking 300,000 files and copy them somewhere. Then copy 1 file which has the size of the 300,000 combined. The single file is MUCH faster (its also why we usually do a tar operation before copying stuff if its already compressed). Any database that's not a toy will usually lay the 300,000 records out in a single file (depending on settings, sizes and filesystem limits).

The 300,000 files end up sitting all over the drive and disk seeks kill you at run-time. This may not be true for a SSD but I don't have any evidence to to suggest this or otherwise.

Even if the physical storage is fine with this I suspect you may run into filesystem issues when you lay out millions if not hundreds of millions of files over a directory and then hit it hard.

I have played with 1,000,000 files before when playing with crawling/indexing things and it becomes a real management pain. It may seem cleaner to lay each out as a singe file but in the long run if you hit a large size the benefits aren't worth it.

hboon · on April 19, 2013

In addition,

It doesn't have good querying utilities. You'd have to build your own indexer and query engine. Since you can't put billions of files in a single directory, you'd have to split them into a directory tree. That alone requires some basic indexing functionality and rebalancing tools (in case a single directory grows too large or too small). This is without any more sophisticated querying capabilities like “where X=val” where X isn't the object ID/filename.

Write performance is going to be very very horrible.

Existing tools around managing, backing up and restoration, performance optimisation and monitoring aren't suitable for handling huge number of small files as well as a subset of them (give certain criteria related to the data itself)

You could build specialized tools to resolve all of these issues, but in the end, you'd end up with some kind of database after hundreds of man-years anyway.