On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > Can we try reducing the size of > > > 8MB memory blocks? The comment atop allocation says: "XXX the > > > allocation sizes used below pre-date generation context's block > > > growing code. These values should likely be benchmarked and set to > > > more suitable values.", so do we need some tuning here? > > > > Reducing the size of the 8MB memory block would be one solution and > > could be better as it could be back-patchable. It would mitigate the > > problem but would not resolve it. I agree to try reducing it and do > > some benchmark tests. If it reasonably makes the problem less likely > > to happen, it would be a good solution. > > > > makes sense.
I've done some benchmark tests for three different code bases with different test cases. In short, reducing the generation memory context block size to 8kB seems to be promising; it mitigates the problem while keeping a similar performance. Here are three code bases that I used: * head: current head code. * per-tx-bump: the proposed idea (with a slight change; each sub and top-level transactions have its own bump memory context to store decoded tuples). * 8kb-mem-block: same as head except for changing the generation memory block size from 8MB to 8kB. And here are test cases and results: 1. Memory usage check I've run the test that I shared before and checked the maximum amount of memory allocated in the reorderbuffer context shown by MemoryContextMemAllocated(). Here are results: head: 2.1GB (while rb->size showing 43MB) per-tx-bump: 50MB (while rb->size showing 43MB) 8kb-mem-block: 54MB (while rb->size showing 43MB) I've confirmed that the excessive memory usage issue didn't happen in the per-tx-bump case and the 8kb-mem-block cases. 2. Decoding many sub transactions IIUC this kind of workload was a trigger to make us invent the Generation Context for logical decoding[1]. The single top-level transaction has 1M sub-transactions each of which insert a tuple. Here are results: head: 31694.163 ms (00:31.694) per-tx-bump: 32661.752 ms (00:32.662) 8kb-mem-block: 31834.872 ms (00:31.835) The head and 8kb-mem-block showed similar results whereas I see there is a bit of regression on per-tx-bump. I think this is because of the overhead of creating and deleting memory contexts for each sub-transactions. 3. Decoding a big transaction The next test case I did is to decode a single big transaction that inserts 10M rows. I set logical_decoding_work_mem large enough to avoid spilling behavior. Here are results: head: 19859.113 ms (00:19.859) per-tx-bump: 19422.308 ms (00:19.422) 8kb-mem-block: 19923.600 ms (00:19.924) There were no big differences. FYI, I also checked the maximum memory usage for this test case as well: head: 1.53GB per-tx-bump: 1.4GB 8kb-mem-block: 1.53GB The per-tx-bump used a bit lesser memory probably thanks to bump memory contexts. 4. Decoding many short transactions. The last test case I did is to decode a bunch of short pgbench transactions (10k transactions). Here are results: head: 31694.163 ms (00:31.694) per-tx-bump: 32661.752 ms (00:32.662) 8kb-mem-block: Time: 31834.872 ms (00:31.835) I can see a similar trend of the test case #2 above. Overall, reducing the generation context memory block size to 8kB seems to be promising. And using the bump memory context per transaction didn't bring performance improvement than I expected in these cases. Regards, [1] https://www.postgresql.org/message-id/flat/20160706185502.1426.28...@wrigleys.postgresql.org -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com