On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: > On 2017/11/29 5:13, Eduardo Habkost wrote: > > > [CCing the people who were copied in the original patch that > > enabled l3cache] > > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > >>> Hi, > >>> > >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >>>> introduced and set by default exposing l3 to the guest. > >>>> > >>>> The motivation behind it was that in the Linux scheduler, when waking up > >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >>>> directly, without sending a reschedule IPI. Reduction in the IPI count > >>>> led to performance gain. > >>>> > >>>> However, this isn't the whole story. Once the task is on the target > >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be > >>>> it the idle task putting the CPU to sleep or just another running task. > >>>> For that a reschedule IPI will have to be issued, too. Only when that > >>>> other CPU is running a normal task for too little time, the fairness > >>>> constraints will prevent the preemption and thus the IPI. > >>>> > > Agree. :) > > Our testing VM is Suse11 guest with idle=poll at that time and now I realize ^^^^^^^^^ Oh, that's a whole lot of a difference! I wish you mentioned that in that patch.
> that Suse11 has a BUG in its scheduler. > > For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued > if > rq->idle is not polling: > ''' > static void ttwu_queue_remote(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > > if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { > if (!set_nr_if_polling(rq->idle)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > } > } > ''' > > But for Suse11, it does not check, it send a RES IPI unconditionally. > > >>>> This boils down to the improvement being only achievable in workloads > >>>> with many actively switching tasks. We had no access to the > >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the > >>>> pattern is also reproduced with "perf bench sched messaging -g 1" > >>>> on 1 socket, 8 cores vCPU topology, we see indeed: > >>>> > >>>> l3-cache #res IPI /s #time / 10000 loops > >>>> off 560K 1.8 sec > >>>> on 40K 0.9 sec > >>>> > >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager > >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >>>> sched pipe -i 100000" gives > >>>> > >>>> l3-cache #res IPI /s #HLT /s #time /100000 loops > >>>> off 200 (no K) 230 0.2 sec > >>>> on 400K 330K 0.5 sec > >>>> > > I guess this issue could be resolved by disable the SD_WAKE_AFFINE. But that requires extra tuning in the guest which is even less likely to happen in the cloud case when VM admin != host admin. > As Gonglei said: > 1. the L3 cache relates to the user experience. > 2. the glibc would get the cache info by CPUID directly, and relates to the > memory performance. > > What's more, the L3 cache relates to the sched_domain which is important to > the > (load) balancer when system is busy. > > All this doesn't mean the patch is insignificant, I just think we should do > more > research before decide. I'll do some tests, thanks. :) Looking forward to it, thanks! Roman.