Jeremy Kerr writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.
Per subsequent discussion and testing, I've committed a patch that
uses a lookup table instead of depending on __builtin_clz().
http://arch
p...@thetdh.com writes:
> Normally I'd try a small lookup table (1-byte index to 1-byte value)
> in this case. But if the bitscan instruction were even close in
> performance, it'd be preferable, due to its more-reliable caching
> behavior;
Well, the problem with the bitscan instruction is we
Normally I'd try a small lookup table (1-byte index to 1-byte value) in this
case. But if the bitscan instruction were even close in performance, it'd be
preferable, due to its more-reliable caching behavior; it should be possible to
capture this at code-configuration time (aligned so as to pro
Jeremy Kerr writes:
> Thanks for the benchmark app, thought I'd pitch in with some ppc
> results:
It looks to me like we should go with the lookup table approach, as
being the best tradeoff of speed improvement vs platform and compiler
independence. The "float hack" is right out ;-)
I wonder w
Hi Greg,
Thanks for the benchmark app, thought I'd pitch in with some ppc
results:
> clz 1.530s
> lookup table 1.720s
> float hack 4.424s
> unrolled 5.280s
> normal 5.369s
POWER5+:
clz 2.046s
lookup table 2.18
Greg Stark writes:
> Well it was a bit of a pain but I filled an array with (1/1000 scaled
> down) values and then permuted them. I also went ahead and set the
> low-order bits to random values since the lookup table based algorithm
> might be affected by it.
> The results are a bit disappointing
On Tue, Jul 21, 2009 at 12:07 AM, Tom Lane wrote:
> Greg Stark writes:
>> I also wonder if this microbenchmark is actually ok because it's
>> testing the same value over and over so any branch-prediction will
>> shine unrealistically well.
>
> Yeah, that is a good point --- and it would benefit th
Greg Stark writes:
> I also wonder if this microbenchmark is actually ok because it's
> testing the same value over and over so any branch-prediction will
> shine unrealistically well.
Yeah, that is a good point --- and it would benefit the unrolled loop
more than the other versions. We should p
Doh, I forgot this bit. Will rerun tests now.
On Mon, Jul 20, 2009 at 8:25 PM, Tom Lane wrote:
> (Note: I found
> out that it's necessary to split the test into two files --- otherwise
> gcc will inline AllocSetFreeIndex and partially const-fold the work,
> leading to skewed results.)
--
gre
On Mon, Jul 20, 2009 at 8:37 PM, Tom Lane wrote:
> Anyone want to see if they can beat that? Some testing on other
> architectures would help too.
Hm, I took the three implementations so far (normal, unrolled, and
clz) as well as the two from
http://graphics.stanford.edu/~seander/bithacks.html#In
I wrote:
> I'm still interested in the idea of doing a manual unroll instead of
> relying on a compiler-specific feature. However, some quick testing
> didn't find an unrolling that helps much.
Hmm, actually this seems to work ok:
idx++;
size >>= 1;
if (size != 0)
Stefan Kaltenbrunner writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction. It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Ass
Tom Lane wrote:
Jeremy Kerr writes:
Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.
This requires a new check for __builtin_clz in the configure script.
Results in a ~2% performance increase on sysbench
Jeremy Kerr writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.
> This requires a new check for __builtin_clz in the configure script.
> Results in a ~2% performance increase on sysbench on PowerPC.
I
Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.
This requires a new check for __builtin_clz in the configure script.
Results in a ~2% performance increase on sysbench on PowerPC.
Signed-off-by: Jeremy Kerr
--
15 matches
Mail list logo