Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

Alexey Budankov Wed, 12 Sep 2018 01:28:07 -0700

Hi,


On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>> incorporating some POSIX API implementation into the tool, avoiding 
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
>> very good 
>> starting point.
> 
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing 
matrix multiplication on 96 core dual socket machines.

> 
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

> 

Thanks,
Alexey

Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

Reply via email to