Amit Langote <amitlangot...@gmail.com> writes: > On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i...@live.jp> > wrote: >> Certainly, using bitmapset contributes to the performance when scanning >> one partition(few partitions) from large partitions.
> Thanks Imai-san for testing. I tried to replicate these numbers with the code as-committed, and could not. What I get, using the same table-creation code as you posted and a pgbench script file like \set param random(1, :N) select * from rt where a = :param; is scaling like this: N tps, range tps, hash 2 10520.519932 10415.230400 8 10443.361457 10480.987665 32 10341.196768 10462.551167 128 10370.953849 10383.885128 512 10207.578413 10214.049394 1024 10042.794340 10121.683993 4096 8937.561825 9214.993778 8192 8247.614040 8486.728918 If I use "-M prepared" the numbers go up a bit for lower N, but drop at high N: N tps, range tps, hash 2 11449.920527 11462.253871 8 11530.513146 11470.812476 32 11372.412999 11450.213753 128 11289.351596 11322.698856 512 11095.428451 11200.683771 1024 10757.646108 10805.052480 4096 8689.165875 8930.690887 8192 7301.609147 7502.806455 Digging into that, it seems like the degradation with -M prepared is mostly in LockReleaseAll's hash_seq_search over the locallock hash table. What I think must be happening is that with -M prepared, at some point the plancache decides to try a generic plan, which causes opening/locking all the partitions, resulting in permanent bloat in the locallock hash table. We immediately go back to using custom plans, but hash_seq_search has more buckets to look through for the remainder of the process' lifetime. I do see some cycles getting spent in apply_scanjoin_target_to_paths that look to be due to scanning over the long part_rels array, which your proposal would ameliorate. But (a) that's pretty small compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths is a remarkable display of brute force inefficiency to begin with. I think we should see if we can't nuke that function altogether in favor of generating the paths with the right target the first time. BTW, the real elephant in the room is the O(N^2) cost of creating these tables in the first place. The runtime for the table-creation scripts looks like N range hash 2 0m0.011s 0m0.011s 8 0m0.015s 0m0.014s 32 0m0.032s 0m0.030s 128 0m0.132s 0m0.099s 512 0m0.969s 0m0.524s 1024 0m3.306s 0m1.442s 4096 0m46.058s 0m15.522s 8192 3m11.995s 0m58.720s This seems to be down to the expense of doing RelationBuildPartitionDesc to rebuild the parent's relcache entry for each child CREATE TABLE. Not sure we can avoid that, but maybe we should consider adopting a cheaper-to-read representation of partition descriptors. The fact that range-style entries seem to be 3X more expensive to load than hash-style entries is strange. regards, tom lane