On Wed, 3 Mar 2021 at 23:32, John Naylor <john.nay...@enterprisedb.com> wrote: > Your performance numbers look like this is a fruitful area to improve. I have > not yet tested performance, but I will do so at a later date.
Thanks for reviewing the patch ! > I did some > microbenchmarking of our popcount implementation, since I wasn't quite sure > it's optimal, and indeed, there is room for improvement there [1]. I'd be > curious to hear your thoughts on those findings. I think it'd be worth it to > test a version of this patch using those idioms here as well, so at some > point I plan to post something. I am not yet clear about the implications of that work on this patch set here, but I am still going over it, and will reply to that. > > Now for the patches: > > 0001: > > + /* > + * We can process 64-bit chunks only if both are mis-aligned by the same > + * number of bytes. > + */ > + if (b_aligned - b == a_aligned - a) > > The obvious question here is: how often are they identically misaligned? You > don't indicate that your measurements differ in a bimodal fashion, so does > that mean they happen to be always (mis)aligned? I ran CREATE INDEX on tsvector columns using the tsearch.data that I had attached upthread, with some instrumentation; here are the proportions : 1. In 15% of the cases, only one among a and b was aligned. The other was offset from the 8-byte boundary by 4 bytes. 2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned. 3. Rest of the cases, both were aligned. With other types, and with different sets of data, I believe I can get totally different proportions. > Is that an accident of the gist coding and could change unexpectedly? > And how hard would it be to > allocate those values upstream so that the pointers are always aligned on > 8-byte boundaries? (I imagine pretty hard, since there are multiple callers, > and such tight coupling is not good style) That I am not sure though; haven't clearly understood the structure of gist indexes yet. I believe it would depend on individual gist implementation, and we can't assume about that ? > + /* For smaller lengths, do simple byte-by-byte traversal */ > + if (bytes <= 32) > > You noted upthread: > > > Since for the gist index creation of some of these types the default > > value for siglen is small (8-20), I tested with small siglens. For > > siglens <= 20, particularly for values that are not multiples of 8 > > (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index > > creation. It's probably because of > > an extra function call for pg_xorcount(); and also might be due to the > > extra logic in pg_xorcount() which becomes prominent for shorter > > traversals. So for siglen less than 32, I kept the existing method > > using byte-by-byte traversal. > > I wonder if that can be addressed directly, while cleaning up the loops and > alignment checks in pg_xorcount_long() a little. For example, have a look at > pg_crc32c_armv8.c -- it might be worth testing a similar approach. Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long(). I avoided that to not hamper the <= 32 scenarios. Details explained below for "why inline pg_xorcount is calling global function" > Also, pardon my ignorance, but where can I find the default siglen for > various types? Check SIGLEN_DEFAULT. > > +/* Count the number of 1-bits in the result of xor operation */ > +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char > *b, > + int bytes); > +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char > *b, > + int bytes) > > I don't think it makes sense to have a static inline function call a global > function. As you may note, the global function will be called only in a subset of cases where siglen <= 32. Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long(). I avoided that to not hamper the <= 32 scenarios. If I do this, it will add a function call for these small siglen scenarios. The idea was: use the function call only for cases where we know that the function call overhead will be masked by the popcount() optimization. > > -static int > +static inline int > hemdistsign(BITVECP a, BITVECP b, int siglen) > > Not sure what the reason is for inlining all these callers. > Come to think of it, I don't see a reason to have hemdistsign() > at all anymore. All it does is call pg_xorcount(). I suspect that's > what Andrey Borodin meant when he said upthread: > > > > > Meanwhile there are at least 4 incarnation of hemdistsign() > > > > functions that are quite similar. I'd propose to refactor them > > > > somehow... I had something in mind when I finally decided to not remove hemdistsign(). Now I think you are right, we can remove hemdistsign() altogether. Let me check again. -------------------- > 0002: > > I'm not really happy with this patch. I like the idea of keeping indirect > calls out of non-x86 platforms, but I believe it could be done more simply. I am open for other approaches that would make this patch simpler. > For one, I don't see a need to invent a third category of retail function. So currently we have pg_popcount64_choose, pg_popcount64_slow and pg_popcount64_asm. With the patch, we have those three, plus pg_popcount64_nonasm. I had earlier considered #defining pg_popcount64 as pg_popcount64_slow in the .h (inside USE_POPCNT_ASM of course) and leave pg_popcount64_slow() untouched. But this will still involve an extra level of function call for each call to pg_popcount64() since pg_popcount64_slow() needs to be an exported function meant to be used in multiple place outside pg_bitutils.c; and our purpose is to avoid indirect calls for this function because it is used so repeatedly. So then I thought why not move the current pg_popcount64_slow() definition to pg_bitutils.h and make it inline. We can do that way. This way that function would look similar to how the other existing functions like pg_leftmost_one_pos64() are defined. But I didn't like it for two reasons: 1) I anyway found the function name confusing. The only slow part of that function is the last part where it does byte-by-byte traversal. That's the reason I renamed pg_popcount64_slow() to pg_popcount64_nonasm() and kept the slow logic in pg_popcount64_slow(). It's ok to keep the slow part in a non-inline function because that part is anyway slow, and is a fallback code for non-supporting platforms. 2) This also keeps the inline pg_popcount64_nonasm() code smaller. The way I look at the final functions is : pg_popcount64_choose() chooses between an asm and non-asm function. pg_popcount64_asm() is the one for running direct assembly code. pg_popcount64_nonasm() is used for platforms where we don't want to call assembly code. So it either calls hardware intrinsics, or calls the slow version if the intrinsics are not available. If I look at these functions this way, it sounds simpler to me. But I understand if it may still sound confusing. That's why I mentioned that I am open to simplifying the patch. Also, the current popcount platform-specific stuff is already confusing; but I feel what the patch is trying to do looks worth it because I am getting an extra 7-8% improvement on my ARM machine. > Second, there's no reason to put "nonasm" in the header as a static inline, > and then call from there to the new now-global function "slow". Explained above, the reason why I shifted the nonasm code in the header and made it inline. > Especially since the supposed static inline is still needed as a possible > value as a function pointer on x86, so the compiler is not going to inline > it on x86 anyway. That just confuses things. Yeah, the inline is anyway just a request to the compiler, right ? On x86, the pg_bitutils.c will have it as non-inline function, and all the other files would have it as an inline function which will never be used. Like I mentioned, it is important to define it as inline, in order to avoid function call when one calls pg_popcount64(). pg_popcount64() should be translated to the built-in intrinsic. > (I did make sure to remove indirect calls from the retail functions > in [1], in case we want to go that route). Yeah, I quickly had a look at that. I am still going over that thread. Thanks for the exhaustive analysis there.