On Friday 26 October 2007 10:55, Christoph Lameter wrote: > On Fri, 26 Oct 2007, Pavel Machek wrote: > > > And, _no_, it does not necessarily mean global serialisation. By > > > simply saying there must be N pages available I say nothing about > > > on which node they should be available, and the way the > > > watermarks work they will be evenly distributed over the > > > appropriate zones. > > > > Agreed. Scalability of emergency swapping reserved is simply > > unimportant. Please, lets get swapping to _work_ first, then we can > > make it faster. > > Global reserve means that any cpuset that runs out of memory may > exhaust the global reserve and thereby impact the rest of the system. > The emergencies that are currently localized to a subset of the > system and may lead to the failure of a job may now become global and > lead to the failure of all jobs running on it.
If it does, it is a bug in the reserve accounting. That said, I still agree with you that per-node reserve is a desirable goal for numa. I would just like to be clear that it is not necessary, even for numa, just nice. By all means somebody should be hacking on a numa feature for per-node emergency reserves, but as far as fixing the immediate, serious kernel block IO deadlocks goes, it does not matter. Pavel, I do not agree that efficiency is unimportant on the under-pressure path. I do not even like to call that the "emergency" path, because under heavy load it is normal for a machine to spend a significant fraction of its time in that state. However, the efficiency goal there does not need to be quite the same as normal mode. To illustrate, I would expect to see something like 95% of normal block IO performance on a numa machine in the case that "emergency" (aka memalloc memory) is allocated globally instead of locally, thus paying a (modest compared to the disk transfer itself) penalty for transfer of disk data over the numa interconnect. 95% of normal throughput on the block IO path is not a problem: if the machine spends 5% of its time on the "emergency" (aka memalloc) path, then overall efficiency will be 95% * 95% = 99.75%. Moral of this story: let's get the memory recursion fixes done in the most obviously correct way and not get distracted by illusory efficiency requirements for numa, that do not have a big bottom line impact. I'm glad to see everybody still interested in these problems. Though we have been a little quiet on this issue over here for a while, it does not mean that progress has stopped. In fact, we are testing our solutions more heavily than ever, and getting closer to a solution that not only works solidly, but that should enable mass deletion of the whole creaky notion of dirty page limits in favor of nice, tight per-device control of in flight write traffic as I have described previously. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/