On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov <b...@alien8.de> wrote: > On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: >> How does pgbench look? That's the one that apparently really wants to >> spread out, possibly due to user-level spinlocks. So I assume it will >> show the reverse pattern, with "kill select_idle_sibling" being the >> worst case. Sad, because it really would be lovely to just remove that >> thing ;) > > Yep, correct. It hurts.
I'm *so* not surprised. That said, I think your "kill select_idle_sibling()" one was interesting, but the wrong kind of "get rid of that logic". It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying "does this group actually cover the cpu I am interested in?" Please tell me I am mis-reading this? But starting from the biggest ("llc" group) is wrong *anyway*, since it means that it starts looking at the L3 level, and then if it finds an acceptable cpu inside that level, it's all done. But that's *crazy*. Once again, it's much better to try to find an idle sibling *closeby* rather than at the L3 level. No? So once again, we should start at the inner level and if we can't find something really close, we work our way out, rather than starting from the outer level and working our way in. If I read the code correctly, we can have both "prev" and "cpu" in the same L2 domain, but because we start looking at the L3 domain, we may end up picking another "affine" CPU that isn't even sharing L2's *before* we pick one that actually *is* sharing L2's with the target CPU. But that code is confusing enough with the scheduler groups inner loop that maybe I am mis-reading it entirely. There are other oddities in select_idle_sibling() too, if I read things correctly. For example, it uses "cpu_idle(target)", but if we're actively trying to move to the current CPU (ie wake_affine() returned true), then target is the current cpu, which is certainly *not* going to be idle for a sync wakeup. So it should actually check whether it's a sync wakeup and the only thing pending is that synchronous waker, no? Maybe I'm missing something really fundamental, but it all really does look very odd to me. Attached is a totally untested and probably very buggy patch, so please consider it a "shouldn't we do something like this instead" RFC rather than anything serious. So this RFC patch is more a "ok, the patch tries to fix the above oddnesses, please tell me where I went wrong" than anything else. Comments? Linus
patch.diff
Description: Binary data