On Fri, Jul 21, 2017 at 03:26:21PM +0200, Peter Zijlstra wrote:
>
> EVENT=0 -DNEW=1 -DFLS=1
> event: 19.626050 +- 0.038995
> EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
> event: 109.610670 +- 0.425667
>
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
> event: 21.445680 +- 0.043782
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHU
On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote:
> On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> > if (x <= 1)
> > return x;
> >
> > - m = 1UL << (BITS_PER_LONG - 2);
> > + m = 1UL << (__fl
On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> if (x <= 1)
> return x;
>
> - m = 1UL << (BITS_PER_LONG - 2);
> + m = 1UL << (__fls(x) & ~1U);
> +
> + while (m > x)
> + m >>= 2;
wh
On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:
> So mayube something like the attached?
>
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Wha
On Thu, Jul 20, 2017 at 3:34 PM, Peter Zijlstra wrote:
>>
>> "Why does this matter, and what is the range it matters for?"
>
> I was looking to do some work on the idle estimator. Parts of that keep
> online avg and variance for normal distributions. I wanted to bias the
> avg downwards, the way
On Thu, Jul 20, 2017 at 11:31:36AM -0700, Linus Torvalds wrote:
> How did this two-year old thread get resurrected?
I was looking for the original thread doing that 'optimization'
Davidlohr did but found this first.
> And the *most* important question is that first one:
>
> "Why does this matte
How did this two-year old thread get resurrected?
Anyway, it got resurrected without even answering one core question:
On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra wrote:
> On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
>>
On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=1000 -DNEW=1 -DFLS=1 -DANSHUL=1
> ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
>
> Performance counter stats for './sqrt' (10 runs):
>
>328,415,775 cycle
On Thu, Jul 20, 2017 at 04:52:49AM -0700, Joe Perches wrote:
> On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
> []
> > + m = 1UL << ((fls(x) + 1) & ~1UL);
>
> maybe
>
> #if BITS_PER_LONG == 64
> m = 1UL << ((fls64(x) + 1) & ~1UL);
> #else
> m = 1UL << ((fls(x) + 1) & ~1UL
On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
[]
> + m = 1UL << ((fls(x) + 1) & ~1UL);
maybe
#if BITS_PER_LONG == 64
m = 1UL << ((fls64(x) + 1) & ~1UL);
#else
m = 1UL << ((fls(x) + 1) & ~1UL);
#endif
On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
> wrote:
> >
> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
> > performance-critical, so it would be good to document that too, since
> > that probably also
Dear Mr. linus,
Thanks for quick replies.
Yes performance numbers are not conclusive enough.
So its better to discard this patch as of now.
I will try to explore more in this area.
Thanks & regards
Anshul Garg
On Fri, Feb 6, 2015 at 1:07 AM, Linus Torvalds
wrote:
> On Thu, Feb 5, 2015 at 1
On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg wrote:
>
> NOTE ::
> I have not used gcc optimizations while compilation.
> With O2 level optimization proposed solution is taking more time.
The thing is, the kernel is compiled with -O2, so that's what matters.
Also, for very tight loops like this,
On Thu, Feb 5, 2015 at 10:33 AM, Linus Torvalds
wrote:
> On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
> wrote:
>>
>> Hmm. I did that too [..]
>
> Side note: one difference in our results (apart from possibly just CPU
> uarch details) is that my loop goes to 100M to make it easier to just
> tim
On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
wrote:
>
> Hmm. I did that too [..]
Side note: one difference in our results (apart from possibly just CPU
uarch details) is that my loop goes to 100M to make it easier to just
time it. Which means that my load essentially had about three more
itera
On Tue, Feb 3, 2015 at 7:42 AM, Anshul Garg wrote:
>
> I have done profiling of int_sqrt function using perf tool for 10 times.
> For this purpose i have created a userspace program which uses sqrt function
> from 1 to a million.
Hmm. I did that too, and it doesn't improve things for me. In fact,
On Tue, Feb 3, 2015 at 10:11 AM, Davidlohr Bueso wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
>> IOW, instead of
>>
>> m = 1UL << (BITS_PER_LONG - 2);
>>
>> perhaps something like
>>
>>m = 1UL << (BITS_PER_LONG/2- 2);
>>if (m < x)
>>m <<= BITS_PER_LONG/2;
On
On Tue, Feb 3, 2015 at 10:17 AM, Davidlohr Bueso wrote:
> On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
>> Hmm. I don't disagree, but would like some more feedback.
>>
>> Davidlohr - you were the person to touch this function last (commit
>> 30493cc9dddb: "lib/int_sqrt.c: optimize squar
On Mon, 2015-02-02 at 20:41 -0800, Davidlohr Bueso wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> > IOW, instead of
> >
> > m = 1UL << (BITS_PER_LONG - 2);
> >
> > perhaps something like
> >
> >m = 1UL << (BITS_PER_LONG/2- 2);
> >if (m < x)
> >m <<= BITS_PE
On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
> Hmm. I don't disagree, but would like some more feedback.
>
> Davidlohr - you were the person to touch this function last (commit
> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
> you did so for performance reasons.
On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> IOW, instead of
>
> m = 1UL << (BITS_PER_LONG - 2);
>
> perhaps something like
>
>m = 1UL << (BITS_PER_LONG/2- 2);
>if (m < x)
>m <<= BITS_PER_LONG/2;
>
> (assuming gcc can change that code into a "cmov") might cut dow
On Mon, Feb 2, 2015 at 11:10 AM, Joe Perches wrote:
>
> Perhaps removing the while and using fls(x) would be better.
No. fls is often slow, and you then do have to also round it down to
an even bit number etc, so it gets somewhat complex too.
We *probably* have some argument range that we care m
On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
wrote:
>
> (I'm also not entirely sure what uses int_sqrt() that ends up being so
> performance-critical, so it would be good to document that too, since
> that probably also matters for the "what's the normal argument range"
> question..)
... it's
On Mon, 2015-02-02 at 09:12 -0800, Anshul Garg wrote:
> From: Anshul Garg
>
> Unnecessary instructions are executing even though m is
> greater than x so added logic to make m less than equal to
> x before performing these operations.
>
> Signed-off-by: Anshul Garg
> ---
> lib/int_sqrt.c |
Hmm. I don't disagree, but would like some more feedback.
Davidlohr - you were the person to touch this function last (commit
30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
you did so for performance reasons. And in fact, when you did that,
you removed that initial loop:
-
From: Anshul Garg
Unnecessary instructions are executing even though m is
greater than x so added logic to make m less than equal to
x before performing these operations.
Signed-off-by: Anshul Garg
---
lib/int_sqrt.c |3 +++
1 file changed, 3 insertions(+)
diff --git a/lib/int_sqrt.c b/li
From: Anshul Garg
Unnecessary instructions are executing even though m is
greater than x so added logic to make m less than equal to
x before performing these operations.
Signed-off-by: Anshul Garg
---
lib/int_sqrt.c |3 +++
1 file changed, 3 insertions(+)
diff --git a/lib/int_sqrt.c b/li
From: Anshul Garg
Unnecessary instructions are executing even though m is
greater than x so added logic to make m less than equal to
x before performing these operations.
Signed-off-by: Anshul Garg
---
lib/int_sqrt.c |3 +++
1 file changed, 3 insertions(+)
diff --git a/lib/int_sqrt.c b/li
28 matches
Mail list logo