I wrote: > However, I then tried a partitioned equivalent of the 6-column case > (script also attached), and it looks like > 6 columns 16551 19097 15637 18201 > which is really noticeably worse, 16% or so.
... and on the third hand, that might just be some weird compiler- and platform-specific artifact. Using the exact same compiler (RHEL8's gcc 8.3.1) on a different x86_64 machine, I measure the same case as about 7% slowdown not 16%. That's still not great, but it calls the original measurement into question, for sure. Using Apple's clang 12.0.0 on an M1 mini, the patch actually clocks in a couple percent *faster* than HEAD, for both the partitioned and unpartitioned 6-column test cases. So I'm not sure what to make of these results, but my level of concern is less than it was earlier today. I might've just gotten trapped by the usual bugaboo of micro-benchmarking, ie putting too much stock in only one test case. regards, tom lane