On Thu, Apr 12, 2007 at 08:32:05AM +1000, Neil Brown wrote: > For the first: > You are storing an internal tree representation of part of the > directory attached to the 'struct file'. > Would it be possible to store this attached to the page via the > ->private pointer? What would avoid the allocate/create/free > overhead on every request.
The reason why we are storing it associated with the file pointer instead of the page/block is because the a filename insertion might cause a node split, in which case half of the 4k block gets copied to another block. We need a stable pointer to where we are in the tree that can cope with hash collisions, and that's the reason for creating red/black tree in the first place, since it *doesn't* get split and reorganized when the directory's hash tree gets reorg'ed. So attaching the tree to the page breaks the reason why we have the separate data structure in the first place. > You suggest caching the open files in nfsd. While that is probably > possible (I've thought of it a number of times) is would also be > quite complex, e.g. requiring some sort of call-back to close all > those files when the filesystem is unexported. And it is very easy > to get caching heuristics wrong. Leveraging the page-cache which is > a very mature cache seems to make a lot of sense. Is it really that complex? The simplest way of handling it is simply keeping a open directory fd cache in a per-filesystem rbtree index which is indexed by file handle and contains the file pointer. When you unexport the filesystem, you simply walk the rbtree and close all of the file descriptors; no callback is required. The caching hueristics are an issue; but fixed-size cache with a simple LFU replacement strategy isn't all that complex to implement. If 95% of the time, the readdir's come in quick succession, even a small cache will probably provide huge performance gains, and increaing the cache size past some critical point will probably only provide marginal improvements. > For the second. > You say that you " would need at least 96 bits in order to make that > guarantee; 64 bits of hash, plus a 32-bit count value in the hash > collision chain". I think 96 is a bit greedy. Surely 48 bits of > hash and 16 bits of collision-chain-position would plenty. You would > need 65537 entries before a collision was even possible, and > billions before it was at all likely. (How big does a set of 48bit > numbers have to get before the probability that "No subset of 65536 > numbers are all the same" drops below 0.95?) > > This would really require that the collision-chain-index was stable > across create/delete. Doing that while you have the tree in the > page cache is probably easy enough. Doing it across reboots is > probably not possible without on-disk changes. Actually, no, we can't keep the collision chain count stable across a create/delete even while the tree is cached. At least, not without storing a huge amount of state associated with each page. (It would be a lot more work than simply having nfsd keep a fd cache for directory streams ;-). If we need create/delete stability, probably our only sane implementation choice is to just stick with a 63-bit hash, and cross our fingers and hope for the best. If nfsd caches the last N used directory caches, where N is roughly proportional to the number of active clients, and the clients all only use the last cookie returned in the readdir entry (since it would be stupid to use one of the earlier ones and request the server to re-send something which the client already has), at least in the absense of telldir/seekdir calls, then that might be quite sufficient, even if we return multiple direntory entries which contain hash collisions to the client. As long as the directory fd is cached, and the client just uses the last cookie to fetch the next batch of dirents, we'll be fine. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/