Michael Stone (12024-08-14): > The short answer is that the reason it handles small files well is because > Reiser wanted the filesystem to be used for direct storage of small objects, > whereas most applications dealing with small objects combine them into a > larger object which is what is stored in the filesystem. E.g., a database > like sqlite stores records in a large file which the database software > manages internally rather than storing each record as a separate file. If > the database wanted to take advantage of this paradigm and store small > records in individual files, it would exhibit ridiculously poor performance > on every other filesystem and OS, and writing a database only for reiserfs > seemed overly limiting. Remember reiserfs was always a research project, and > never quite done; > reiser4 pushed these concepts further (e.g., added various atomic > transaction modes) but never got merged.
Except the original plan did not hold water, even at the time. The blocks at using the file system instead of a more advanced format is not just the inefficiency of the storage. First, there are system calls, they are expensive. Reading a file takes at least three system calls: open, read, close, that is assuming you already have enough memory and the file is small enough to fit in it in a single read. With one record per file, you need three system calls per record. With multiple records per file, you can read thousands of records with the same number of system calls. Or use mmap and have all the records available without system calls — but with page faults. Second, the file system offers only key → value conversion and hierarchical enumeration: you can efficiently get at a file if you know its name, or a set of files if they are all one directory. But if you want, for example, the files in a certain interval of time, no luck. You could organize your directories to make the kind of request you make frequently efficient, like having a YYYY/MM/DD/HH hierarchy, but it is made awkward by the very limited API of the file system, and cannot even remotely compete with the indexing abilities of structured formats with multiple records per file. Third (and last of what I think of right now), libraries or servers to handle structured data often infrastructure to ensure non-trivial consistency in the data. For example it can delete automatically sub-records associated with a main record you just deleted. With the file system, you would have to reinvent all that. Do not get me wrong, I am not a fan at all of “if all you've got is SQL, everything looks like a flat list, even a straightforward tree structure”, but the “just use the file system” people do not even realize the kind of services they do render. Regards, -- Nicolas George