Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

Daniel Phillips Sat, 27 Oct 2007 16:01:26 -0700

On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.


If it does, it is a bug in the reserve accounting.  That said, I still 
agree with you that per-node reserve is a desirable goal for numa.  I 
would just like to be clear that it is not necessary, even for numa, 
just nice.  By all means somebody should be hacking on a numa feature 
for per-node emergency reserves, but as far as fixing the immediate, 
serious kernel block IO deadlocks goes, it does not matter.

Pavel, I do not agree that efficiency is unimportant on the 
under-pressure path.  I do not even like to call that the "emergency" 
path, because under heavy load it is normal for a machine to spend a 
significant fraction of its time in that state.  However, the 
efficiency goal there does not need to be quite the same as normal 
mode.

To illustrate, I would expect to see something like 95% of normal block 
IO performance on a numa machine in the case that "emergency" (aka 
memalloc memory) is allocated globally instead of locally, thus paying 
a (modest compared to the disk transfer itself) penalty for transfer of 
disk data over the numa interconnect.  95% of normal throughput on the 
block IO path is not a problem: if the machine spends 5% of its time on 
the "emergency" (aka memalloc) path, then overall efficiency will be 
95% * 95% = 99.75%.

Moral of this story: let's get the memory recursion fixes done in the 
most obviously correct way and not get distracted by illusory 
efficiency requirements for numa, that do not have a big bottom line 
impact.

I'm glad to see everybody still interested in these problems.  Though we 
have been a little quiet on this issue over here for a while, it does 
not mean that progress has stopped.  In fact, we are testing our 
solutions more heavily than ever, and getting closer to a solution that 
not only works solidly, but that should enable mass deletion of the 
whole creaky notion of dirty page limits in favor of nice, tight 
per-device control of in flight write traffic as I have described 
previously.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

Reply via email to