> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!
Or, with an appropriately unrolled loop, for each word:
zero %eax, cmove a 1 to %al
cmove a 1 to %ah
shift %eax left, cmove a 1 to %al
cmove a 1 to %ah,
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > >
> > > > I think it would be better if we just did
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> >
> > > I think it would be better if we just did the prefetch here
> > > and re-addressed this area when AVX (or addcx/a
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
>
> > I think it would be better if we just did the prefetch here
> > and re-addressed this area when AVX (or addcx/addox) instructions were
> > available
> > for testing on hardwa
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were
> available
> for testing on hardware.
Could there be a difference if only a single software
prefetch was d
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote:
> > How would you suggest replacing the jumps in this case? I agree it would be
> > faster here, but I'm not sure how I would implement an increment using a
> > single
> > conditional move.
>
> I think you need 3 instructions, move a
> How would you suggest replacing the jumps in this case? I agree it would be
> faster here, but I'm not sure how I would implement an increment using a
> single
> conditional move.
I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!
If you d
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
> > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> > [...]
> > > It
> > > functions, but unfortunately the performance lost to the completely broken
> > > branch predictio
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
> On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> [...]
> > It
> > functions, but unfortunately the performance lost to the completely broken
> > branch prediction that this inflicts makes it a non starter:
> [...]
>
> Conditio
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
> It
> functions, but unfortunately the performance lost to the completely broken
> branch prediction that this inflicts makes it a non starter:
[...]
Conditional branches are no good but conditional moves might be worth a shot.
Ben.
--
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > > etc. For such short runtimes make sure the last column displays
> > > > > close to 100%, s
* Neil Horman wrote:
> Prefetch and simluated adcx/adox from above:
> Performance counter stats for './test.sh' (20 runs):
>
> 35,704,331 L1-dcache-load-misses
>( +- 0.07% ) [75.00%]
> 0 L1-dcache-prefetches
* Neil Horman wrote:
> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > > etc. For such short runtimes make sure the last column displays
> > > > close to 100%, so that the PMU results become trustable.
> > > >
> > > > A nehalem+ PMU will a
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
> oth
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > > etc. For such short runtimes make sure the last column displays
> > > close to 100%, so that the PMU results become trustable.
> > >
> > > A nehalem+ PMU will allow 2-4 events to be measured in parall
* Neil Horman wrote:
> > etc. For such short runtimes make sure the last column displays
> > close to 100%, so that the PMU results become trustable.
> >
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> > plus generics like 'cycles', 'instructions' can be added 'for free
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
> oth
...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely. David, you mentioned you've tried this, how did
> your experiment turn out and what was your method? I was planning on
On 10/30/2013 07:02 AM, Neil Horman wrote:
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or other
entities that can be run in parallel.
If they're just going to seria
On 10/30/2013 08:18 AM, David Laight wrote:
/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
Howeve
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active t
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman wrote:
> > 3) The run times are proportionally larger, but still indicate that
> > Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive. I'm
> > looking into it, but thought you might w
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
> some CPU in the past had a set of overflow/carry flags per
* Neil Horman wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel
> ALU
> execution is hurting rather than helping, which is counter-intuitive. I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you
So here'
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > So, I apologize, you were right. I was running the test.sh script
> > but perf was measuring itself. [...]
>
> Ok, cool - one mystery less!
>
> > Which overall looks alot more like I expect, save for
* Neil Horman wrote:
> So, I apologize, you were right. I was running the test.sh script
> but perf was measuring itself. [...]
Ok, cool - one mystery less!
> Which overall looks alot more like I expect, save for the parallel
> ALU cases. It seems here that the parallel ALU changes actually
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results shortly
On 10/29/13 6:52 AM, Ingo Molnar wrote:
According to the perf man page, I'm supposed to be able to use --
to separate perf command line parameters from the command I want
to run. And it definately executed test.sh, I added an echo to
stdout in there as a test run and observed them get captured i
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results shortly
* Neil Horman wrote:
> I'm sure it worked properly on my system here, I specificially
> checked it, but I'll gladly run it again. You have to give me an
> hour as I have a meeting to run to, but I'll have results shortly.
So what I tried to react to was this observation of yours:
> > > Here
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_
* Neil Horman wrote:
> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
> > /root/
* Neil Horman wrote:
> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
> /root/test.sh
> done >> counters.txt 2>&1
>
> where test.sh is:
> #!/bin/sh
> echo 1
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Heres my data for running the same test with taskset restricting
> > execution to only cpu0. I'm not quite sure whats going on here,
> > but doing so resulted in a 10x slowdown of the runtime of each
* Doug Ledford wrote:
> [ Snipped a couple of really nice real-life bandwidth tests. ]
> Some of my preliminary results:
>
> 1) Regarding the initial claim that changing the code to have two
> addition chains, allowing the use of two ALUs, doubling
> performance: I'm just not seeing it. I h
* Neil Horman wrote:
> Heres my data for running the same test with taskset restricting
> execution to only cpu0. I'm not quite sure whats going on here,
> but doing so resulted in a 10x slowdown of the runtime of each
> iteration which I can't explain. As before however, both the
> parall
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Looking at the specific cpu counters we get this:
> > >
> > > Base:
> > > Total time: 0.179 [sec]
> > >
> > > Performance cou
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Base:
> >0.093269042 seconds time elapsed
> > ( +- 2.24% )
> > Prefetch (5x64):
> >0.079440009 seconds time elapsed
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Looking at the specific cpu counters we get this:
> >
> > Base:
> > Total time: 0.179 [sec]
> >
> > Performance counter stats for 'perf bench sched messaging -- bash -c echo
> > 1 > /sys/module/c
On 10/26/2013 07:55 AM, Ingo Molnar wrote:
>
> * Doug Ledford wrote:
>
>>> What I was objecting to strongly here was to measure the _wrong_
>>> thing, i.e. the cache-hot case. The cache-cold case should be
>>> measured in a low noise fashion, so that results are
>>> representative. It's closer to
On 10/28/13 10:24 AM, Ingo Molnar wrote:
The most accurate method of measurement for such single-threaded
workloads is something like:
taskset 0x1 perf stat -a -C 1 --repeat 20 ...
this will bind your workload to CPU#0, and will do PMU measurements
only there - without mixing in other
* Neil Horman wrote:
> Looking at the specific cpu counters we get this:
>
> Base:
> Total time: 0.179 [sec]
>
> Performance counter stats for 'perf bench sched messaging -- bash -c echo 1
> > /sys/module/csum_test/parameters/test_fire' (20 runs):
>
>1571.304618 task-clock
* Neil Horman wrote:
> Base:
>0.093269042 seconds time elapsed
>( +- 2.24% )
> Prefetch (5x64):
>0.079440009 seconds time elapsed
>( +- 2.29% )
> Parallel ALU:
>0.08777 seconds tim
Ingo, et al.-
Ok, sorry for the delay, here are the test results you've been asking
for.
First, some information about what I did. I attached the module that I ran this
test with at the bottom of this email. You'll note that I started using a
module parameter write patch to trigger th
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > > You keep ignoring my request to calculate and account for noise of the
> > > measurement.
> >
> > Don't confuse "ignoring" with "haven't gotten there yet". [...]
>
> So, instead of replying to my re
* Neil Horman wrote:
> > You keep ignoring my request to calculate and account for noise of the
> > measurement.
>
> Don't confuse "ignoring" with "haven't gotten there yet". [...]
So, instead of replying to my repeated feedback with a single line mail
that you plan to address it, you repea
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > >
> > > >
> > > > Ok, so I ran the above code on a single cpu using taskset,
* Neil Horman wrote:
> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> >
> > >
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq
> > > affinity
> > > such that no interrupts (save for local on
* Doug Ledford wrote:
> > What I was objecting to strongly here was to measure the _wrong_
> > thing, i.e. the cache-hot case. The cache-cold case should be
> > measured in a low noise fashion, so that results are
> > representative. It's closer to the real usecase than any other
> > microbe
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is
> the whole reason for their existence.
>
Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any co
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> >
> > Ok, so I ran the above code on a single cpu using taskset, and set irq
> > affinity
> > such that no interrupts (save for local ones), would occur on that cpu.
> > No
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> Ok, so I ran the above code on a single cpu using taskset, and set irq
> affinity
> such that no interrupts (save for local ones), would occur on that cpu. Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt varian
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> >
On 10/19/2013 04:23 AM, Ingo Molnar wrote:
>
> * Doug Ledford wrote:
>> All prefetch operations get sent to an access queue in the memory
>> controller where they compete with both other reads and writes for the
>> available memory bandwidth. The optimal prefetch window is not a factor
>> of
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > >
> > > > #define BUFSIZ_ORDER 4
> > > > #define B
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> >
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init csum_i
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> >
* Doug Ledford wrote:
> >> Based on these, prefetching is obviously a a good improvement, but
> >> not as good as parallel execution, and the winner by far is doing
> >> both.
>
> OK, this is where I have to chime in that these tests can *not* be used
> to say anything about prefetch, and no
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > >
>> > > lpq83:~
On 2013-10-17, Ingo wrote:
> * Neil Horman wrote:
>
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > >
>> > > > So, early testing results today. I wrote
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> int i;
> __wsum sum = 0;
> struct timespec start, end;
> u64 time;
> struct page *pag
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > >
>
> > for(i=0;i<10;i++) {
> > sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE :
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> >
> for(i=0;i<10;i++) {
> sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
> }
Please replace this by random accesses, and use the mo
If implemented properly adcx/adox should give additional speedup... that is the
whole reason for their existence.
Neil Horman wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices imp
>
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
>
> Try now with 32 huges pages, to get 64 MBytes of working set.
>
> Because in reality we never csum_partial() data in cpu cache.
>
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't
> > have
> > checksum offload hardware were spending a significant amount of time
> > computing
> > checksum
* H. Peter Anvin wrote:
> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> >
> > To correctly simulate the workload you'd have to:
> >
> > - allocate a buffer larger than your L2 cache.
> >
> > - to measure the effects of the prefetches you'd also have to randomize
> >the individual buffer
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:
> Seriously, though, how much does it matter? All the above seems likely
> to do is to drown the signal by adding noise.
I don't think so.
>
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenc
On 10/17/2013 01:41 AM, Ingo Molnar wrote:
>
> To correctly simulate the workload you'd have to:
>
> - allocate a buffer larger than your L2 cache.
>
> - to measure the effects of the prefetches you'd also have to randomize
>the individual buffer positions. See how 'perf bench numa' implem
* Neil Horman wrote:
> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > >
> > > > So, early testing results today. I wrote a test module that, allocated
> >
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that
> I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just doin
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> >
> > > So, early testing results today. I wrote a test module that, allocated a
> > > 4k
> > > buffer, initalized it
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
> Prefetch takes memory from L2->L1 memory
> just as much as it moves it cachelines from memory to the L2 cache.
Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.
--
To unsubscribe from this list: send the line "unsu
* Joe Perches wrote:
> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches wrote:
> >
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:
> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.
Wait a minute, my point was to remind that main cost is the
memory fetching.
It
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
>
> At least in the ipoib case i would consider that a misconfiguration.
There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is no
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches wrote:
>
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings much
> I get the csum_partial() if disabling prequeue.
At least in the ipoib case i would consider that a misconfiguration.
"don't do this if it hurts"
There may be more such problems.
-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:
> And the receiver should also do the same : (ethtool -K eth0 rx off)
>
> 10.55%netserver [kernel.kallsyms] [k]
> csum_partial_copy_generic
I get the csum_partial() if disabling prequeue.
echo 1 >/proc/sys/net/ipv4/tcp
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
>
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet wrote:
> > But the csum cost is both for sender and receiver ?
>
> No, it was only on the receiver side that I noticed it.
>
Yes, as Andi said, we do the csum while cop
Hi Eric,
On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet wrote:
> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen wrote:
> >
> > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR
> > > > hardware
> > > > wher
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen wrote:
>
> > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR
> > > hardware
> > > where one cannot benefit from hardware offloads.
> >
> > Is this with sendfile?
>
>
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen wrote:
> > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
>
> Is this with sendfile?
Tests were done with iperf at the time without any extra funky options, and
look
> indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.
Is this with sendfile?
For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases
-Andi
--
To unsubsc
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen wrote:
> >
> > > Neil Horman writes:
> > >
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > > don't have checksum offload ha
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > > don't have che
* Borislav Petkov wrote:
> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
>
> Also, I have this hazy memory of prefetch hints being harmful in some
>
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.
Also, I have this hazy memory of prefetch hints being harmful in some
situations: https://lwn.net/Articles/44
Hi Neil, Andi,
On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman wrote:
> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have
> > > checksum offload hardware were s
* Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > >
> > > > lpq83:~# ./netperf -H 7.7.8.84 -
* Neil Horman wrote:
> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant amount
> > > of time comput
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > >
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM T
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> >
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84
> > () port
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
>
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
> port 0 AF_INET
> Recv SendSend Utilizati
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>
> > So, early testing results today. I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 10
> > times, recording th
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> So, early testing results today. I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 10
> times, recording the time at the start and end of that loop. Results on a 2.4
> GHz
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen wrote:
>
> > Neil Horman writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant
> > > amount of time computing
> >
> >
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant amount
> > of time computing checksums. We found that by split
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman writes:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't
> > have
> > checksum offload hardware were spending a significant amount of time
> > computing
>
> Must be an odd workload, most
1 - 100 of 107 matches
Mail list logo