Hi,
On 11.09.2018 17:19, Peter Zijlstra wrote: > On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote: >>> Well, explicit threading in the tool for AIO, in the simplest case, means >>> incorporating some POSIX API implementation into the tool, avoiding >>> code reuse in the first place. That tends to be error prone and costly. >> >> It's a core competency, we better do it right and not outsource it. >> >> Please take a look at Jiri's patches (once he re-posts them), I think it's a >> very good >> starting point. > > There's another reason for doing custom per-cpu threads; it avoids > bouncing the buffer memory around the machine. If the task doing the > buffer reads is the exact same as the one doing the writes, there's less > memory traffic on the interconnects. Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains for kernel/user buffers allocation, needs to be taken into account by the effective solution. Luckily data losses hasn't been observed when testing matrix multiplication on 96 core dual socket machines. > > Also, I think we can avoid the MFENCE in that case, but I'm not sure > that one is hot enough to bother about on the perf reading side of > things. Yep, *FENCE may be costly in HW, especially on larger scale. > Thanks, Alexey