On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote: > I'm *so* not surprised. > > That said, I think your "kill select_idle_sibling()" one was > interesting, but the wrong kind of "get rid of that logic".
Yeah. > It always selected target_cpu, but the fact is, that doesn't really > sound very sane. The target cpu is either the previous cpu or the > current cpu, depending on whether they should be balanced or not. But > that still doesn't make any *sense*. > > In fact, the whole select_idle_sibling() logic makes no sense > what-so-ever to me. It seems to be total garbage. > > For example, it starts with the maximum target scheduling domain, and > works its way in over the scheduling groups within that domain. What > the f*ck is the logic of that kind of crazy thing? It never makes > sense to look at a biggest domain first. If you want to be close to > something, you want to look at the *smallest* domain first. But > because it looks at things in the wrong order, it then needs to have > that inner loop saying "does this group actually cover the cpu I am > interested in?" > > Please tell me I am mis-reading this? First of all, I'm so *not* a scheduler guy so take this with a great pinch of salt. The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. Oh, and then there's the userspace spinlocks thingie where Mike's patch hurts us. Btw, Mike, you can jump in anytime :-) So I'd say, this is the hard scheduling problem where fitting the workload to the architecture doesn't make everyone happy. A crazy thought: one could go and sample tasks while running their timeslices with the perf counters to know exactly what type of workload we're looking at. I.e., do I have a large number of L2 evictions? Yes, then spread them out. No, then select the other core on the L2. And so on. > But starting from the biggest ("llc" group) is wrong *anyway*, since > it means that it starts looking at the L3 level, and then if it > finds an acceptable cpu inside that level, it's all done. But that's > *crazy*. Once again, it's much better to try to find an idle sibling > *closeby* rather than at the L3 level. No? Exactly my thoughts a couple of days ago but see above. > So once again, we should start at the inner level and if we can't find > something really close, we work our way out, rather than starting from > the outer level and working our way in. > > If I read the code correctly, we can have both "prev" and "cpu" in > the same L2 domain, but because we start looking at the L3 domain, we > may end up picking another "affine" CPU that isn't even sharing L2's > *before* we pick one that actually *is* sharing L2's with the target > CPU. But that code is confusing enough with the scheduler groups inner > loop that maybe I am mis-reading it entirely. > > There are other oddities in select_idle_sibling() too, if I read > things correctly. > > For example, it uses "cpu_idle(target)", but if we're actively trying > to move to the current CPU (ie wake_affine() returned true), then > target is the current cpu, which is certainly *not* going to be idle > for a sync wakeup. So it should actually check whether it's a sync > wakeup and the only thing pending is that synchronous waker, no? > > Maybe I'm missing something really fundamental, but it all really does > look very odd to me. > > Attached is a totally untested and probably very buggy patch, so > please consider it a "shouldn't we do something like this instead" RFC > rather than anything serious. So this RFC patch is more a "ok, the > patch tries to fix the above oddnesses, please tell me where I went > wrong" than anything else. > > Comments? Let me look at it tomorrow, on a fresh head. Too late here now. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/