Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

Neil Horman Tue, 15 Oct 2013 06:14:54 -0700

On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <[email protected]> wrote:
> 
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <[email protected]> wrote:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant amount 
> > > > of time computing checksums.  We found that by splitting the checksum 
> > > > computation into two separate streams, each skipping successive 
> > > > elements 
> > > > of the buffer being summed, we could parallelize the checksum operation 
> > > > accros multiple alus.  Since neither chain is dependent on the result 
> > > > of 
> > > > the other, we get a speedup in execution (on hardware that has multiple 
> > > > alu's available, which is almost ubiquitous on x86), and only a 
> > > > negligible decrease on hardware that has only a single alu (an extra 
> > > > addition is introduced).  Since addition in commutative, the result is 
> > > > the same, only faster
> > > 
> > > This patch should really come with measurement numbers: what performance 
> > > increase (and drop) did you get on what CPUs.
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > > 
> > 
> > 
> > So, early testing results today.  I wrote a test module that, allocated 
> > a 4k buffer, initalized it with random data, and called csum_partial on 
> > it 100000 times, recording the time at the start and end of that loop.  
> 
> It would be nice to stick that testcase into tools/perf/bench/, see how we 
> are able to benchmark the kernel's mempcy and memset implementation there:
> 
Sure, my module is a mess currently.  But as soon as I investigate the use of
ADCX/ADOX that Anvin suggested I'll see about integrating that
Neil


>  $ perf bench mem memcpy -r help
>  # Running 'mem/memcpy' benchmark:
>  Unknown routine:help
>  Available routines...
>         default ... Default memcpy() provided by glibc
>         x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
> 
> In a similar fashion we could build the csum_partial() code as well and do 
> measurements. (We could change arch/x86/ code as well to make such 
> embedding/including easier, as long as it does not change performance.)
> 
> Thanks,
> 
>       Ingo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

Reply via email to