* Mike Galbraith <efa...@gmx.de> wrote: > I think the pgbench problem is more about latency for the 1 in > 1:N than spinlocks.
So my understanding of the psql workload is that basically we've got a central psql proxy process that is distributing work to worker psql processes. If a freshly woken worker process ever preempts the central proxy process then it is preventing a lot of new work from getting distributed. Correct? So the central proxy psql process is 'much more important' to run than any of the worker processes - an importance that is not (currently) visible from the behavioral statistics the scheduler keeps on tasks. So the scheduler has the following problem here: a new wakee might be starved enough and the proxy might have run long enough to really justify the preemption here and now. The buddy statistics help avoid some of these cases - but not all and the difference is measurable. Yet the 'best' way for psql to run is for this proxy process to never be preempted. Your SCHED_BATCH experiments confirmed that. The way remote CPU selection affects it is that if we ever get more aggressive in selecting a remote CPU then we, as a side effect, also reduce the chance of harmful preemption of the central proxy psql process. So in that sense sibling selection is somewhat of an indirect red herring: it really only helps psql indirectly by preventing the harmful preemption. It also, somewhat paradoxially argues for suboptimal code: for example tearing apart buddies is beneficial in the psql workload, because it also allows the more important part of the buddy to run more (the proxy). In that sense the *real* problem isnt even parallelism (although we obviously should improve the decisions there - and the logic has suffered in the past from the psql dilemma outlined above), but whether the scheduler can (and should) identify the central proxy and keep it running as much as possible, deprioritizing fairness, wakeup buddies, runtime overlap and cache affinity considerations. There's two broad solutions that I can see: - Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two. We already try to do a couple of robust metrics, like overlap statistics to identify buddies. - Let user-space occasionally identify its important (and less important) tasks - say psql could mark it worker processes as SCHED_BATCH and keep its central process(es) higher prio. A single line of obvious code in 100 KLOCs of user-space code. Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/