[CCing the people who were copied in the original patch that enabled l3cache]
On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > Hi, > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >> introduced and set by default exposing l3 to the guest. > >> > >> The motivation behind it was that in the Linux scheduler, when waking up > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >> directly, without sending a reschedule IPI. Reduction in the IPI count > >> led to performance gain. > >> > >> However, this isn't the whole story. Once the task is on the target > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > >> it the idle task putting the CPU to sleep or just another running task. > >> For that a reschedule IPI will have to be issued, too. Only when that > >> other CPU is running a normal task for too little time, the fairness > >> constraints will prevent the preemption and thus the IPI. > >> > >> This boils down to the improvement being only achievable in workloads > >> with many actively switching tasks. We had no access to the > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > >> pattern is also reproduced with "perf bench sched messaging -g 1" > >> on 1 socket, 8 cores vCPU topology, we see indeed: > >> > >> l3-cache #res IPI /s #time / 10000 loops > >> off 560K 1.8 sec > >> on 40K 0.9 sec > >> > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >> sched pipe -i 100000" gives > >> > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > >> off 200 (no K) 230 0.2 sec > >> on 400K 330K 0.5 sec > >> > >> In a more realistic test, we observe 15% degradation in VM density > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > >> requests per second to its main page, with 95%-percentile response > >> latency under 100 ms) with l3-cache=on. > >> > >> We think that mostly-idle scenario is more common in cloud and personal > >> usage, and should be optimized for by default; users of highly loaded > >> VMs should be able to tune them up themselves. > >> > > There's one thing I don't understand in your test case: if you > > just found out that Linux will behave worse if it assumes that > > the VCPUs are sharing a L3 cache, why are you configuring a > > 8-core VCPU topology explicitly? > > > > Do you still see a difference in the numbers if you use "-smp 8" > > with no "cores" and "threads" options? > > > This is quite simple. A lot of software licenses are bound to the amount > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > with 1 socket/xx cores to reduce the amount of money necessary to > be paid for the software. In this case it looks like we're talking about the expected meaning of "cores=N". My first interpretation would be that the user obviously want the guest to see the multiple cores sharing a L3 cache, because that's how real CPUs normally work. But I see why you have different expectations. Numbers on dedicated-pCPU scenarios would be helpful to guide the decision. I wouldn't like to cause a performance regression for users that fine-tuned vCPU topology and set up CPU pinning. -- Eduardo