Thanks for looking at this. On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthali...@gmail.com> wrote: > I've checked for Clang 6, it turns out that indeed it generates popcnt without > any macro, but only in one place for bloom_prop_bits_set. After looking at > this > function it seems that it would be benefitial to actually use popcnt there > too.
Yeah, that's the pattern that's mentioned in https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ It would need to be changed to call the popcount function. This existing makes me a bit more worried that some extension could be using a similar pattern and end up being compiled with -mpopcnt due to pg_config having that CFLAG. That's all fine until the binary makes it's way over to a machine without that instruction. > > I am able to measure performance gains from the patch. In a 3.4GB > > table containing a single column with just 10 statistics targets, I > > got the following times after running ANALYZE on the table. > > I've tested it too a bit, and got similar results when the patched version is > slightly faster. But then I wonder if popcnt is the best solution here, since > after some short research I found a paper [1], where authors claim that: > > Maybe surprisingly, we show that a vectorized approach using SIMD > instructions can be twice as fast as using the dedicated instructions on > recent Intel processors. > > > [1]: https://arxiv.org/pdf/1611.07612.pdf I can't imagine that using the number_of_ones[] array processing 8-bits at a time would be slower than POPCNT though. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services