RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-04 Thread David Laight
> > I think you need 3 instructions, move a 0, conditionally move a 1 > > then add. I suspect it won't be a win! Or, with an appropriately unrolled loop, for each word: zero %eax, cmove a 1 to %al cmove a 1 to %ah shift %eax left, cmove a 1 to %al cmove a 1 to %ah,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote: > On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: > > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > > > > > I think it would be better if we just did

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > > > I think it would be better if we just did the prefetch here > > > and re-addressed this area when AVX (or addcx/a

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > I think it would be better if we just did the prefetch here > > and re-addressed this area when AVX (or addcx/addox) instructions were > > available > > for testing on hardwa

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > I think it would be better if we just did the prefetch here > and re-addressed this area when AVX (or addcx/addox) instructions were > available > for testing on hardware. Could there be a difference if only a single software prefetch was d

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote: > > How would you suggest replacing the jumps in this case? I agree it would be > > faster here, but I'm not sure how I would implement an increment using a > > single > > conditional move. > > I think you need 3 instructions, move a

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread David Laight
> How would you suggest replacing the jumps in this case? I agree it would be > faster here, but I'm not sure how I would implement an increment using a > single > conditional move. I think you need 3 instructions, move a 0, conditionally move a 1 then add. I suspect it won't be a win! If you d

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote: > On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: > > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: > > [...] > > > It > > > functions, but unfortunately the performance lost to the completely broken > > > branch predictio

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: > [...] > > It > > functions, but unfortunately the performance lost to the completely broken > > branch prediction that this inflicts makes it a non starter: > [...] > > Conditio

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: [...] > It > functions, but unfortunately the performance lost to the completely broken > branch prediction that this inflicts makes it a non starter: [...] Conditional branches are no good but conditional moves might be worth a shot. Ben. --

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > > etc. For such short runtimes make sure the last column displays > > > > > close to 100%, s

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman wrote: > Prefetch and simluated adcx/adox from above: > Performance counter stats for './test.sh' (20 runs): > > 35,704,331 L1-dcache-load-misses >( +- 0.07% ) [75.00%] > 0 L1-dcache-prefetches

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman wrote: > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > > etc. For such short runtimes make sure the last column displays > > > > close to 100%, so that the PMU results become trustable. > > > > > > > > A nehalem+ PMU will a

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: > On 10/30/2013 07:02 AM, Neil Horman wrote: > > >That does makes sense, but it then begs the question, whats the advantage of > >having multiple alu's at all? > > There's lots of ALU operations that don't operate on the flags or > oth

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > > etc. For such short runtimes make sure the last column displays > > > close to 100%, so that the PMU results become trustable. > > > > > > A nehalem+ PMU will allow 2-4 events to be measured in parall

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Ingo Molnar
* Neil Horman wrote: > > etc. For such short runtimes make sure the last column displays > > close to 100%, so that the PMU results become trustable. > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, > > plus generics like 'cycles', 'instructions' can be added 'for free

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: > On 10/30/2013 07:02 AM, Neil Horman wrote: > > >That does makes sense, but it then begs the question, whats the advantage of > >having multiple alu's at all? > > There's lots of ALU operations that don't operate on the flags or > oth

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
... > and then I also wanted to try using both xmm and ymm registers and doing > 64bit adds with 32bit numbers across multiple xmm/ymm registers as that > should parallel nicely. David, you mentioned you've tried this, how did > your experiment turn out and what was your method? I was planning on

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 07:02 AM, Neil Horman wrote: That does makes sense, but it then begs the question, whats the advantage of having multiple alu's at all? There's lots of ALU operations that don't operate on the flags or other entities that can be run in parallel. If they're just going to seria

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 08:18 AM, David Laight wrote: /me wonders if rearranging the instructions into this order: adcq 0*8(src), res1 adcq 1*8(src), res2 adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register. Howeve

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
> /me wonders if rearranging the instructions into this order: > adcq 0*8(src), res1 > adcq 1*8(src), res2 > adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register. However you'd need to get 3 of those active t

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote: > * Neil Horman wrote: > > 3) The run times are proportionally larger, but still indicate that > > Parallel ALU > > execution is hurting rather than helping, which is counter-intuitive. I'm > > looking into it, but thought you might w

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
> The parallel ALU design of this patch seems OK at first glance, but it means > that two parallel operations are both trying to set/clear both the overflow > and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either > some CPU in the past had a set of overflow/carry flags per

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Doug Ledford
* Neil Horman wrote: > 3) The run times are proportionally larger, but still indicate that Parallel > ALU > execution is hurting rather than helping, which is counter-intuitive. I'm > looking into it, but thought you might want to see these results in case > something jumped out at you So here'

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > So, I apologize, you were right. I was running the test.sh script > > but perf was measuring itself. [...] > > Ok, cool - one mystery less! > > > Which overall looks alot more like I expect, save for

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > So, I apologize, you were right. I was running the test.sh script > but perf was measuring itself. [...] Ok, cool - one mystery less! > Which overall looks alot more like I expect, save for the parallel > ALU cases. It seems here that the parallel ALU changes actually

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > I'm sure it worked properly on my system here, I specificially > > checked it, but I'll gladly run it again. You have to give me an > > hour as I have a meeting to run to, but I'll have results shortly

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread David Ahern
On 10/29/13 6:52 AM, Ingo Molnar wrote: According to the perf man page, I'm supposed to be able to use -- to separate perf command line parameters from the command I want to run. And it definately executed test.sh, I added an echo to stdout in there as a test run and observed them get captured i

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > I'm sure it worked properly on my system here, I specificially > > checked it, but I'll gladly run it again. You have to give me an > > hour as I have a meeting to run to, but I'll have results shortly

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > I'm sure it worked properly on my system here, I specificially > checked it, but I'll gladly run it again. You have to give me an > hour as I have a meeting to run to, but I'll have results shortly. So what I tried to react to was this observation of yours: > > > Here

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > Sure it was this: > > > > for i in `seq 0 1 3` > > > > do > > > > echo $i > /sys/module/csum_

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Sure it was this: > > > for i in `seq 0 1 3` > > > do > > > echo $i > /sys/module/csum_test/parameters/module_test_mode > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Sure it was this: > > for i in `seq 0 1 3` > > do > > echo $i > /sys/module/csum_test/parameters/module_test_mode > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- > > /root/

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > Sure it was this: > for i in `seq 0 1 3` > do > echo $i > /sys/module/csum_test/parameters/module_test_mode > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- > /root/test.sh > done >> counters.txt 2>&1 > > where test.sh is: > #!/bin/sh > echo 1

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Heres my data for running the same test with taskset restricting > > execution to only cpu0. I'm not quite sure whats going on here, > > but doing so resulted in a 10x slowdown of the runtime of each

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Doug Ledford wrote: > [ Snipped a couple of really nice real-life bandwidth tests. ] > Some of my preliminary results: > > 1) Regarding the initial claim that changing the code to have two > addition chains, allowing the use of two ALUs, doubling > performance: I'm just not seeing it. I h

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > Heres my data for running the same test with taskset restricting > execution to only cpu0. I'm not quite sure whats going on here, > but doing so resulted in a 10x slowdown of the runtime of each > iteration which I can't explain. As before however, both the > parall

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote: > On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Looking at the specific cpu counters we get this: > > > > > > Base: > > > Total time: 0.179 [sec] > > > > > > Performance cou

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Base: > >0.093269042 seconds time elapsed > > ( +- 2.24% ) > > Prefetch (5x64): > >0.079440009 seconds time elapsed

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Looking at the specific cpu counters we get this: > > > > Base: > > Total time: 0.179 [sec] > > > > Performance counter stats for 'perf bench sched messaging -- bash -c echo > > 1 > /sys/module/c

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Doug Ledford
On 10/26/2013 07:55 AM, Ingo Molnar wrote: > > * Doug Ledford wrote: > >>> What I was objecting to strongly here was to measure the _wrong_ >>> thing, i.e. the cache-hot case. The cache-cold case should be >>> measured in a low noise fashion, so that results are >>> representative. It's closer to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread David Ahern
On 10/28/13 10:24 AM, Ingo Molnar wrote: The most accurate method of measurement for such single-threaded workloads is something like: taskset 0x1 perf stat -a -C 1 --repeat 20 ... this will bind your workload to CPU#0, and will do PMU measurements only there - without mixing in other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman wrote: > Looking at the specific cpu counters we get this: > > Base: > Total time: 0.179 [sec] > > Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > > /sys/module/csum_test/parameters/test_fire' (20 runs): > >1571.304618 task-clock

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman wrote: > Base: >0.093269042 seconds time elapsed >( +- 2.24% ) > Prefetch (5x64): >0.079440009 seconds time elapsed >( +- 2.29% ) > Parallel ALU: >0.08777 seconds tim

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
Ingo, et al.- Ok, sorry for the delay, here are the test results you've been asking for. First, some information about what I did. I attached the module that I ran this test with at the bottom of this email. You'll note that I started using a module parameter write patch to trigger th

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Neil Horman
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > > You keep ignoring my request to calculate and account for noise of the > > > measurement. > > > > Don't confuse "ignoring" with "haven't gotten there yet". [...] > > So, instead of replying to my re

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Ingo Molnar
* Neil Horman wrote: > > You keep ignoring my request to calculate and account for noise of the > > measurement. > > Don't confuse "ignoring" with "haven't gotten there yet". [...] So, instead of replying to my repeated feedback with a single line mail that you plan to address it, you repea

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Neil Horman
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > > > > > > > Ok, so I ran the above code on a single cpu using taskset,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Neil Horman wrote: > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > > > > Ok, so I ran the above code on a single cpu using taskset, and set irq > > > affinity > > > such that no interrupts (save for local on

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Doug Ledford wrote: > > What I was objecting to strongly here was to measure the _wrong_ > > thing, i.e. the cache-hot case. The cache-cold case should be > > measured in a low noise fashion, so that results are > > representative. It's closer to the real usecase than any other > > microbe

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-25 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote: > If implemented properly adcx/adox should give additional speedup... that is > the whole reason for their existence. > Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a stepping of this CPU to test any co

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > Ok, so I ran the above code on a single cpu using taskset, and set irq > > affinity > > such that no interrupts (save for local ones), would occur on that cpu. > > No

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > Ok, so I ran the above code on a single cpu using taskset, and set irq > affinity > such that no interrupts (save for local ones), would occur on that cpu. Note > that I had to convert csum_partial_opt to csum_partial, as the _opt varian

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > #define BUFSIZ_ORDER 4 > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > static int __init csum_init_module(void) > > { > > int i; > > __wsum sum = 0; > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Doug Ledford
On 10/19/2013 04:23 AM, Ingo Molnar wrote: > > * Doug Ledford wrote: >> All prefetch operations get sent to an access queue in the memory >> controller where they compete with both other reads and writes for the >> available memory bandwidth. The optimal prefetch window is not a factor >> of

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote: > On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: > > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > > > > > #define BUFSIZ_ORDER 4 > > > > #define B

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > > > #define BUFSIZ_ORDER 4 > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > > static int __init csum_i

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-20 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > #define BUFSIZ_ORDER 4 > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > static int __init csum_init_module(void) > > { > > int i; > > __wsum sum = 0; > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-19 Thread Ingo Molnar
* Doug Ledford wrote: > >> Based on these, prefetching is obviously a a good improvement, but > >> not as good as parallel execution, and the winner by far is doing > >> both. > > OK, this is where I have to chime in that these tests can *not* be used > to say anything about prefetch, and no

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote: > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: >> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: >> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: >> > > attached patch brings much better results >> > > >> > > lpq83:~

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On 2013-10-17, Ingo wrote: > * Neil Horman wrote: > >> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: >> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: >> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: >> > > >> > > > So, early testing results today. I wrote

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > #define BUFSIZ_ORDER 4 > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > static int __init csum_init_module(void) > { > int i; > __wsum sum = 0; > struct timespec start, end; > u64 time; > struct page *pag

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: > > > > > > for(i=0;i<10;i++) { > > sum = csum_partial(buf+offset, PAGE_SIZE, sum); > > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE :

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: > > > for(i=0;i<10;i++) { > sum = csum_partial(buf+offset, PAGE_SIZE, sum); > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0; > } Please replace this by random accesses, and use the mo

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread H. Peter Anvin
If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence. Neil Horman wrote: >On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: >> On 10/11/2013 09:51 AM, Neil Horman wrote: >> > Sébastien Dugué reported to me that devices imp

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
> > Your benchmark uses a single 4K page, so data is _super_ hot in cpu > caches. > ( prefetch should give no speedups, I am surprised it makes any > difference) > > Try now with 32 huges pages, to get 64 MBytes of working set. > > Because in reality we never csum_partial() data in cpu cache. >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: > On 10/11/2013 09:51 AM, Neil Horman wrote: > > Sébastien Dugué reported to me that devices implementing ipoib (which don't > > have > > checksum offload hardware were spending a significant amount of time > > computing > > checksum

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Ingo Molnar
* H. Peter Anvin wrote: > On 10/17/2013 01:41 AM, Ingo Molnar wrote: > > > > To correctly simulate the workload you'd have to: > > > > - allocate a buffer larger than your L2 cache. > > > > - to measure the effects of the prefetches you'd also have to randomize > >the individual buffer

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Eric Dumazet
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote: > Seriously, though, how much does it matter? All the above seems likely > to do is to drown the signal by adding noise. I don't think so. > > If the parallel (threaded) checksumming is faster, which theory says it > should and microbenc

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread H. Peter Anvin
On 10/17/2013 01:41 AM, Ingo Molnar wrote: > > To correctly simulate the workload you'd have to: > > - allocate a buffer larger than your L2 cache. > > - to measure the effects of the prefetches you'd also have to randomize >the individual buffer positions. See how 'perf bench numa' implem

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Ingo Molnar
* Neil Horman wrote: > On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > > > > > So, early testing results today. I wrote a test module that, allocated > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Eric Dumazet
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote: > > > > So I went to reproduce these results, but was unable to (due to the fact that > I > only have a pretty jittery network to do testing accross at the moment with > these devices). So instead I figured that I would go back to just doin

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Neil Horman
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > > > So, early testing results today. I wrote a test module that, allocated a > > > 4k > > > buffer, initalized it

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Joe Perches
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote: > Prefetch takes memory from L2->L1 memory > just as much as it moves it cachelines from memory to the L2 cache. Yup, mea culpa. I thought the prefetch was still to L1 like the Pentium. -- To unsubscribe from this list: send the line "unsu

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Joe Perches wrote: > On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: > > * Joe Perches wrote: > > > > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote: > Ingo, Eric _showed_ that the prefetch is good here. > How about looking at a little optimization to the minimal > prefetch that gives that level of performance. Wait a minute, my point was to remind that main cost is the memory fetching. It

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote: > > I get the csum_partial() if disabling prequeue. > > At least in the ipoib case i would consider that a misconfiguration. There is nothing you can do, if application is not blocked on recv(), but using poll()/epoll()/select(), prequeue is no

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Joe Perches
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: > * Joe Perches wrote: > > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > > > attached patch brings much

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
> I get the csum_partial() if disabling prequeue. At least in the ipoib case i would consider that a misconfiguration. "don't do this if it hurts" There may be more such problems. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote: > And the receiver should also do the same : (ethtool -K eth0 rx off) > > 10.55%netserver [kernel.kallsyms] [k] > csum_partial_copy_generic I get the csum_partial() if disabling prequeue. echo 1 >/proc/sys/net/ipv4/tcp

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote: > Hi Eric, > > On Tue, 15 Oct 2013 07:06:25 -0700 > Eric Dumazet wrote: > > But the csum cost is both for sender and receiver ? > > No, it was only on the receiver side that I noticed it. > Yes, as Andi said, we do the csum while cop

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet wrote: > On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: > > On Tue, 15 Oct 2013 15:33:36 +0200 > > Andi Kleen wrote: > > > > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR > > > > hardware > > > > wher

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: > On Tue, 15 Oct 2013 15:33:36 +0200 > Andi Kleen wrote: > > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR > > > hardware > > > where one cannot benefit from hardware offloads. > > > > Is this with sendfile? > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen wrote: > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware > > where one cannot benefit from hardware offloads. > > Is this with sendfile? Tests were done with iperf at the time without any extra funky options, and look

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
> indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware > where one cannot benefit from hardware offloads. Is this with sendfile? For normal send() the checksum is done in the user copy and for receiving it can be also done during the copy in most cases -Andi -- To unsubsc

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: > > * Andi Kleen wrote: > > > > > Neil Horman writes: > > > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > > don't have checksum offload ha

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > > don't have che

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Borislav Petkov wrote: > On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: > > Most processors have hundreds of cachelines even in their L1 cache. > > Thousands in the L2 cache, up to hundreds of thousands. > > Also, I have this hazy memory of prefetch hints being harmful in some >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Borislav Petkov
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: > Most processors have hundreds of cachelines even in their L1 cache. > Thousands in the L2 cache, up to hundreds of thousands. Also, I have this hazy memory of prefetch hints being harmful in some situations: https://lwn.net/Articles/44

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Neil, Andi, On Mon, 14 Oct 2013 16:25:28 -0400 Neil Horman wrote: > On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: > > Neil Horman writes: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have > > > checksum offload hardware were s

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Joe Perches wrote: > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > > attached patch brings much better results > > > > > > > > lpq83:~# ./netperf -H 7.7.8.84 -

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Neil Horman wrote: > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have checksum offload hardware were spending a significant amount > > > of time comput

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Joe Perches
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > attached patch brings much better results > > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > > > MIGRATED TCP STREAM T

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > attached patch brings much better results > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 > > () port

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Joe Perches
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > attached patch brings much better results > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () > port 0 AF_INET > Recv SendSend Utilizati

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > So, early testing results today. I wrote a test module that, allocated a 4k > > buffer, initalized it with random data, and called csum_partial on it 10 > > times, recording th

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > So, early testing results today. I wrote a test module that, allocated a 4k > buffer, initalized it with random data, and called csum_partial on it 10 > times, recording the time at the start and end of that loop. Results on a 2.4 > GHz

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: > * Andi Kleen wrote: > > > Neil Horman writes: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have checksum offload hardware were spending a significant > > > amount of time computing > > > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > don't have checksum offload hardware were spending a significant amount > > of time computing checksums. We found that by split

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: > Neil Horman writes: > > > Sébastien Dugué reported to me that devices implementing ipoib (which don't > > have > > checksum offload hardware were spending a significant amount of time > > computing > > Must be an odd workload, most

  1   2   >