Re: Speeding up GIST index creation for tsvectors

Amit Khandekar Mon, 08 Mar 2021 04:43:51 -0800

On Wed, 3 Mar 2021 at 23:32, John Naylor <john.nay...@enterprisedb.com> wrote:
> Your performance numbers look like this is a fruitful area to improve. I have 
> not yet tested performance, but I will do so at a later date.


Thanks for reviewing the patch !

> I did some
> microbenchmarking of our popcount implementation, since I wasn't quite sure
> it's optimal, and indeed, there is room for improvement there [1]. I'd be
> curious to hear your thoughts on those findings. I think it'd be worth it to
> test a version of this patch using those idioms here as well, so at some
> point I plan to post something.

I am not yet clear about the implications of that work on this patch
set here, but I am still going over it, and will reply to that.

>
> Now for the patches:
>
> 0001:
>
> + /*
> + * We can process 64-bit chunks only if both are mis-aligned by the same
> + * number of bytes.
> + */
> + if (b_aligned - b == a_aligned - a)
>
> The obvious question here is: how often are they identically misaligned? You
> don't indicate that your measurements differ in a bimodal fashion, so does
> that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned.
3. Rest of the cases, both were aligned.

With other types, and with different sets of data, I believe I can get
totally different proportions.

> Is that an accident of the gist coding and could change unexpectedly?
> And how hard would it be to
> allocate those values upstream so that the pointers are always aligned on
> 8-byte boundaries? (I imagine pretty hard, since there are multiple callers,
> and such tight coupling is not good style)

That I am not sure though; haven't clearly understood the structure of
gist indexes yet. I believe it would depend on individual gist
implementation, and we can't assume about that ?


> + /* For smaller lengths, do simple byte-by-byte traversal */
> + if (bytes <= 32)
>
> You noted upthread:
>
> > Since for the gist index creation of some of these types the default
> > value for siglen is small (8-20), I tested with small siglens. For
> > siglens <= 20, particularly for values that are not multiples of 8
> > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > creation. It's probably because of
> > an extra function call for pg_xorcount(); and also might be due to the
> > extra logic in pg_xorcount() which becomes prominent for shorter
> > traversals. So for siglen less than 32, I kept the existing method
> > using byte-by-byte traversal.
>
> I wonder if that can be addressed directly, while cleaning up the loops and
> alignment checks in pg_xorcount_long() a little. For example, have a look at
> pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

> Also, pardon my ignorance, but where can I find the default siglen for 
> various types?
Check SIGLEN_DEFAULT.

>
> +/* Count the number of 1-bits in the result of xor operation */
> +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char 
> *b,
> +   int bytes);
> +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char 
> *b,
> + int bytes)
>
> I don't think it makes sense to have a static inline function call a global 
> function.

As you may note, the global function will be called only in a subset
of cases where siglen <= 32. Yeah, we can put the bytes <= 32
condition inside pg_xorcount_long(). I avoided that to not hamper the
<= 32 scenarios. If I do this, it will add a function call for these
small siglen scenarios. The idea was: use the function call only for
cases where we know that the function call overhead will be masked by
the popcount() optimization.


>
> -static int
> +static inline int
>  hemdistsign(BITVECP a, BITVECP b, int siglen)
>
>  Not sure what the reason is for inlining all these callers.
> Come to think of it, I don't see a reason to have hemdistsign()
> at all anymore. All it does is call pg_xorcount(). I suspect that's
> what Andrey Borodin meant when he said upthread:
>
>  > > > Meanwhile there are at least 4 incarnation of hemdistsign()
> > > > functions that are quite similar. I'd propose to refactor them 
> > > > somehow...

I had something in mind when I finally decided to not remove
hemdistsign(). Now I think you are right, we can remove hemdistsign()
altogether. Let me check again.



--------------------


> 0002:
>
> I'm not really happy with this patch. I like the idea of keeping indirect
> calls out of non-x86 platforms, but I believe it could be done more simply.

I am open for other approaches that would make this patch simpler.

> For one, I don't see a need to invent a third category of retail function.

So currently we have pg_popcount64_choose, pg_popcount64_slow and
pg_popcount64_asm.
With the patch, we have those three, plus pg_popcount64_nonasm.

I had earlier considered #defining pg_popcount64 as pg_popcount64_slow
in the .h (inside USE_POPCNT_ASM of course) and leave
pg_popcount64_slow() untouched. But this will still involve an extra
level of function call for each call to pg_popcount64() since
pg_popcount64_slow() needs to be an exported function meant to be used
in multiple place outside pg_bitutils.c; and our purpose is to avoid
indirect calls for this function because it is used so repeatedly.

So then I thought why not move the current pg_popcount64_slow()
definition to pg_bitutils.h and make it inline. We can do that way.
This way that function would look similar to how the other existing
functions like pg_leftmost_one_pos64() are defined. But I didn't like
it for two reasons:
1) I anyway found the function name confusing. The only slow part of
that function is the last part where it does byte-by-byte traversal.
That's the reason I renamed pg_popcount64_slow() to
pg_popcount64_nonasm() and kept the slow logic in
pg_popcount64_slow(). It's ok to keep the slow part in a non-inline
function because that part is anyway slow, and is a fallback code for
non-supporting platforms.
2) This also keeps the inline pg_popcount64_nonasm() code smaller.

The way I look at the final functions is :
pg_popcount64_choose() chooses between an asm and non-asm function.
pg_popcount64_asm() is the one for running direct assembly code.
pg_popcount64_nonasm() is used for platforms where we don't want to
call assembly code. So it either calls hardware intrinsics, or calls
the slow version if the intrinsics are not available.

If I look at these functions this way, it sounds simpler to me. But I
understand if it may still sound confusing. That's why I mentioned
that I am open to simplifying the patch. Also, the current popcount
platform-specific stuff is already confusing; but I feel what the
patch is trying to do looks worth it because I am getting an extra
7-8% improvement on my ARM machine.

> Second, there's no reason to put "nonasm" in the header as a static inline,
> and then call from there to the new now-global function "slow".

Explained above, the reason why I shifted the nonasm code in the
header and made it inline.

> Especially since the supposed static inline is still needed as a possible
> value as a function pointer on x86, so the compiler is not going to inline
> it on x86 anyway. That just confuses things.

Yeah, the inline is anyway just a request to the compiler, right ? On
x86, the pg_bitutils.c will have it as non-inline function, and all
the other files would have it as an inline function which will never
be used.
Like I mentioned, it is important to define it as inline, in order to
avoid function call when one calls pg_popcount64(). pg_popcount64()
should be translated to the built-in intrinsic.

> (I did make sure to remove  indirect calls  from the retail functions
> in [1], in case we want to go that route).

Yeah, I quickly had a look at that. I am still going over that thread.
Thanks for the exhaustive analysis there.

Re: Speeding up GIST index creation for tsvectors

Reply via email to