Well, this has turned into a rather sticky little problem. I've
spent all day going through the vnode/name-cache reclaim code, looking
both at Seigo's cache_purgeleafdirs() and my own patch.
This is what is going on: The old code refused to reuse any vnode that
had (A) Cached VM Pages associated with it *AND* (B) refused to reuse
directory vnodes residing in the namei cache that contained any
subdirectory or file. (B) does not apply to file vnodes since they
obviously cannot have subdirectories or files 'under' them in the namei
cache. The problem is that when you take the union of (A) and (B),
just about every directory vnode in the system winds up being immunue
from reclamation. Thus directory vnodes appear to grow forever... so
it isn't just the fact that most directories are small that is the
problem, it's the presence of (B). This is why small files don't cause
the same problem (or at least do not cause it to the same degree).
Both Seigo's cache_purgeleafdirs() and my simpler patch simply remove
the (B) requirement, making directory reclamation work approximately
the same as file reclamation. The only difference between Seigo's
patch and mine is that Seigo's makes an effort to remove directories
intelligently... it tries to avoid removing higher level directories.
My patch doesn't make a distinction but assumes that (A) will tend to
hold for higher level directories: that is, that higher level directories
tend to be accessed more often and thus will tend to have pages in the
VM Page Cache, and thus not be candidates for reuse anyway. So my patch
has a very similar effect but without the overhead.
In all the testing I've done I cannot perceive any performance difference
between Seigo's patch and mine, but from an algorithmic point of view
mine ought to scale much, much better. Even if we adjust
cache_purgeleafdirs() to run even less often, we still run up against
the fact that the scanning algorithm is O(N*M) and we know from history
that this can create serious breakage.
People may recall that we had similar problems with the VM Pageout
daemon, where under certain load conditions the pageout daemon wound
up running continuously, eating enormous amounts of cpu. We lived with
the problem for years because the scaling issues didn't rear their
heads until machines got hefty enough to have enough pages for the
algorithms to break down.
People may also recall that we had similar problems with the buffer
cache code.... specifically, the scan 'restart' conditions could
break down algorithmically and result in massive cpu use by bufdaemon.
I think cache_purgeleafdirs() had the right idea. From my experience
with the VM system, however, I have to recommend that we remove it
from the system and, at least initially, replace it with my simpler
patch. We could extend my patch to do the same check -- that is, only
remove directory vnodes at lower levels in the namei cache, simply
by scanning the namei cache list at the vnode in question. So in fact
it would be possible to adjust my patch to have the same effect that
cache_purgeleafdirs() had, but without the scaling issue (or at least
with less of an issue.. it would be O(N) rather then O(M*N)).
-
The bigger problem is exactly as DG has stated... it isn't the namei
cache that is our enemy, it's the VM Page cache preventing vnodes
from being recycled.
For the moment I believe that cache_purgeleafdirs() or my patch solves
the problem well enough that we can run with it for a while. The real
solution, I believe, is to give us the ability to take cached VM Pages
associated with a file and rename them to cached VM Pages associated
with the filesystem device - we can do this only for page-aligned
blocks of course, not fragments (which we would simply throw away)...
but it would allow us to reclaim vnodes independant of the VM Page cache
without losing the cached pages. I think this is doable but it will
require a considerable amount of work. It isn't something I can do in a
day. I also believe that this can dovetail quite nicely into the I/O
model that we have slowly been moving towards over the last year
(Poul's work). Inevitably we will have to manage device-based I/O
on a page-by-page basis and being able to do it via a VM Object seems
to fit the bill in my opinion.
-Matt
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message