On Thursday 11 August 2005 20:49, David Howells wrote: > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > To be honest I'm having some trouble following this through logically. > > I'll read through a few more times and see if that fixes the problem. > > This seems cluster-related, so I have an interest. > > Well, perhaps I can explain the function for which I'm using this page flag > more clearly. You'll have to excuse me if it's covering stuff you don't > know, but I want to take it from first principles; plus this stuff might > well find its way into the kernel docs. > > > We want to use a relatively fast medium (such as RAM or local disk) to > speed up repeated accesses to a relatively slow medium (such as NFS, NBD, > CDROM) by means of caching the results of previous accesses to the slow > medium on the fast medium. > > Now we already do this at one level: RAM. The page cache _is_ such a cache, > but whilst it's much faster than a disk, it is severely restricted in size
Did you just suggest that 16 TB/address_space is too small to cache NFS pages? > compared to media such as disks, it's more expensive It is? > and it's contents generally don't last over power failure or reboots. When used by RAMFS maybe. But fortunately the page cache has a backing store API, in fact, that is its raison d'etre. > The major attribute of the page cache is that the CPU can access it > directly. You seem to have forgotten about non-resident pages. > So we want to add another level: local disk. The FS-Cache/CacheFS patches > permit such as AFS and NFS to use local disk as a cache. The page cache already lets you do that. I have not yet discerned a fundamental reason why you need to interface to another filesystem to implement backing store for an address_space. > So, assume that NFS is using a local disk cache (it doesn't matter whether > it's CacheFS, CacheFiles, or something else), and assume a process has a > file open through NFS. > > The process attempts to read from the file. This causes the NFS readpage() > or readpages() operation to be invoked to load the data into the page cache > so that the CPU can make use of it. > > So the NFS page reading algorithm first consults the disk cache. Assume > this returns a negative response - NFS will then read from the server into > the page cache. Under cacheless operation, it would then unlock the page > and the kernel could then let userspace play with it, but we're dealing > with a cache, and so the newly fetched data must be stored in the disk > cache for future retrieval. > > NFS now has three choices: > > (1) It could institigate a write to the disk cache and wait for that to > complete before unlocking the page and letting userspace see it, but > we don't know how long that might take. Pages are typically unlocked while being written to backing store, e.g.: http://lxr.linux.no/source/fs/buffer.c#L1839 What makes NFS special in this regard? > CacheFS immediately dispatches a write BIO to get it DMA'd to the disk > as soon as possible, but something like CacheFiles is dependent on an > underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the > write, and we've no control over that. That is a problem you are in the process of inventing. > Time to unlock: CacheMiss + NetRead + CacheWrite > Cache reliable: Yes > > (2) It could just unlock the page and let userspace scribble on it whilst > simultaneously writing it to the cache. But that means the DMA to the > disk may pick up some of userspace's scribblings, and that means you > can't trust what's in the cache in the event of a power loss. I thought I saw a journal in there. Anyway, if the user has asked for a racy write, that is what they should get. > This can be alleviated by marking untrustworthy files in the cache, > but that then extends the management time in several ways. > > Time to unlock: CacheMiss + NetRead > Cache reliable: No I think your definition of trustworthy goes beyond what is required by Posix or Linux local filesystem semantics. > (3) It could tell the cache that the page needs writing to disk and then > unlock it for userspace to read, but intercept the change of a PTE > pointing to this page when it loses its write protection (PTEs start > off read-only, generating a write protection fault on the first write). We need to do something like this to implemented cross-node caching of shared-writeable mmaps. This is another reason that your ideas need clear explanations: we need to go the rest of the way and get this sorted out for cluster filesystems in general, not just NFS (v4). It does help a lot that you are attempting to explain what the needs of NFS actually are. Unfortunately, it seems you are proposing that this mechanism is essential even for single-node use, which is far from clear. > The interceptor would then force userspace to wait for the cache to > finish DMA'ing the page before writing to it. > > Similarly, the write() or prepare_write() operations would wait for > the cache to finish with that page. Here you return to the assumption that the VFS should enforce per-page write granularity. There is no such rule as far as I know. > Time to unlock: CacheMiss + NetRead > Cache reliable: Yes > > I originally chose option (1), but then I saw just how much it affected > performance and worked on option (3). > > I discarded option (2) because I want to be able to have some surety about > the state in the cache - I don't want to have to reinitialise it after a > power failure. Imagine if you cache /usr... Imagine if everyone in a very > large office caches /usr... > > > So, the way I implemented (3) is to use an extra page flag to indicate a > write underway to the cache, and thus allow cache write status to be > determined when someone wants to scribble on a page. > > The fscache_write_page() function takes a pointer to a callback function. > In NFS this function clears the PG_fs_misc bit on the appropriate pages and > wakes up anyone who was waiting for this event (end_page_fs_misc()). > > The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that > page bit if it is set. > > > Who is using this interface? > > AFS and NFS will both use it. There may be others eventually who use it for > the same purpose. CacheFS has a different use for it internally. Let's try to clear up the page write atomicity question, please. It seems your argument depends on it. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/