Re: [RFC][patch 0/2] mm: remove PageReserved

Daniel Phillips Fri, 12 Aug 2005 12:37:42 -0700

On Thursday 11 August 2005 20:49, David Howells wrote:
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > To be honest I'm having some trouble following this through logically. 
> > I'll read through a few more times and see if that fixes the problem. 
> > This seems cluster-related, so I have an interest.
>
> Well, perhaps I can explain the function for which I'm using this page flag
> more clearly. You'll have to excuse me if it's covering stuff you don't
> know, but I want to take it from first principles; plus this stuff might
> well find its way into the kernel docs.
>
>
> We want to use a relatively fast medium (such as RAM or local disk) to
> speed up repeated accesses to a relatively slow medium (such as NFS, NBD,
> CDROM) by means of caching the results of previous accesses to the slow
> medium on the fast medium.
>
> Now we already do this at one level: RAM. The page cache _is_ such a cache,
> but whilst it's much faster than a disk, it is severely restricted in size


Did you just suggest that 16 TB/address_space is too small to cache NFS pages?

> compared to media such as disks, it's more expensive

It is?

> and it's contents generally don't last over power failure or reboots.

When used by RAMFS maybe.  But fortunately the page cache has a backing store 
API, in fact, that is its raison d'etre.

> The major attribute of the page cache is that the CPU can access it
> directly. 

You seem to have forgotten about non-resident pages.

> So we want to add another level: local disk. The FS-Cache/CacheFS patches
> permit such as AFS and NFS to use local disk as a cache.

The page cache already lets you do that.  I have not yet discerned a 
fundamental reason why you need to interface to another filesystem to 
implement backing store for an address_space.

> So, assume that NFS is using a local disk cache (it doesn't matter whether
> it's CacheFS, CacheFiles, or something else), and assume a process has a
> file open through NFS.
>
> The process attempts to read from the file. This causes the NFS readpage()
> or readpages() operation to be invoked to load the data into the page cache
> so that the CPU can make use of it.
>
> So the NFS page reading algorithm first consults the disk cache.  Assume 
> this returns a negative response - NFS will then read from the server into
> the page cache. Under cacheless operation, it would then unlock the page
> and the kernel could then let userspace play with it, but we're dealing
> with a cache, and so the newly fetched data must be stored in the disk
> cache for future retrieval.
>
> NFS now has three choices:
>
>  (1) It could institigate a write to the disk cache and wait for that to
>      complete before unlocking the page and letting userspace see it, but
> we don't know how long that might take.

Pages are typically unlocked while being written to backing store, e.g.:

http://lxr.linux.no/source/fs/buffer.c#L1839

What makes NFS special in this regard?

>      CacheFS immediately dispatches a write BIO to get it DMA'd to the disk
> as soon as possible, but something like CacheFiles is dependent on an
> underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
> write, and we've no control over that.

That is a problem you are in the process of inventing.

>       Time to unlock: CacheMiss + NetRead + CacheWrite
>       Cache reliable: Yes
>
>  (2) It could just unlock the page and let userspace scribble on it whilst
>      simultaneously writing it to the cache. But that means the DMA to the
>      disk may pick up some of userspace's scribblings, and that means you
>      can't trust what's in the cache in the event of a power loss.

I thought I saw a journal in there.  Anyway, if the user has asked for a racy 
write, that is what they should get.

>      This can be alleviated by marking untrustworthy files in the cache,
> but that then extends the management time in several ways.
>
>       Time to unlock: CacheMiss + NetRead
>       Cache reliable: No

I think your definition of trustworthy goes beyond what is required by Posix 
or Linux local filesystem semantics.

>  (3) It could tell the cache that the page needs writing to disk and then
>      unlock it for userspace to read, but intercept the change of a PTE
>      pointing to this page when it loses its write protection (PTEs start
> off read-only, generating a write protection fault on the first write).

We need to do something like this to implemented cross-node caching of 
shared-writeable mmaps.  This is another reason that your ideas need clear 
explanations: we need to go the rest of the way and get this sorted out for 
cluster filesystems in general, not just NFS (v4).  It does help a lot that 
you are attempting to explain what the needs of NFS actually are.  
Unfortunately, it seems you are proposing that this mechanism is essential 
even for single-node use, which is far from clear.

>      The interceptor would then force userspace to wait for the cache to
>      finish DMA'ing the page before writing to it.
>
>      Similarly, the write() or prepare_write() operations would wait for
> the cache to finish with that page.

Here you return to the assumption that the VFS should enforce per-page write 
granularity.  There is no such rule as far as I know.

>       Time to unlock: CacheMiss + NetRead
>       Cache reliable: Yes
>
> I originally chose option (1), but then I saw just how much it affected
> performance and worked on option (3).
>
> I discarded option (2) because I want to be able to have some surety about
> the state in the cache - I don't want to have to reinitialise it after a
> power failure. Imagine if you cache /usr... Imagine if everyone in a very
> large office caches /usr...
>
>
> So, the way I implemented (3) is to use an extra page flag to indicate a
> write underway to the cache, and thus allow cache write status to be
> determined when someone wants to scribble on a page.
>
> The fscache_write_page() function takes a pointer to a callback function.
> In NFS this function clears the PG_fs_misc bit on the appropriate pages and
> wakes up anyone who was waiting for this event (end_page_fs_misc()).
>
> The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
> page bit if it is set.
>
> > Who is using this interface?
>
> AFS and NFS will both use it. There may be others eventually who use it for
> the same purpose. CacheFS has a different use for it internally.

Let's try to clear up the page write atomicity question, please.  It seems 
your argument depends on it.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][patch 0/2] mm: remove PageReserved

Reply via email to