On Mon, Feb 26, 2024 at 6:43 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > > > On 2/26/24 07:46, Masahiko Sawada wrote: > > On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra > > <tomas.von...@enterprisedb.com> wrote: > >>... > >> > >> overall design > >> -------------- > >> > >> As for the design, I agree with the approach of using a binaryheap to > >> track transactions by size. When going over the thread history, > >> describing the initial approach with only keeping "large" transactions > >> above some threshold (e.g. 10%), I was really concerned that'll either > >> lead to abrupt changes in behavior (when transactions move just around > >> the 10%), or won't help with many common cases (with most transactions > >> being below the limit). > >> > >> I was going to suggest some sort of "binning" - keeping lists for > >> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and > >> evicting transactions from a list, i.e. based on approximate size. But > >> if the indexed binary heap seems to be cheap enough, I think it's a > >> better solution. > > > > I've also considered the binning idea. But it was not clear to me how > > it works well in a case where all transactions belong to the > > particular class. For example, if we need to free up 1MB memory, we > > could end up evicting 2000 transactions consuming 50 bytes instead of > > 100 transactions consuming 1000 bytes, resulting in that we end up > > serializing more transactions. Also, I'm concerned about the cost of > > maintaining the binning lists. > > > > I don't think the list maintenance would be very costly - in particular, > the lists would not need to be sorted by size. You're right in some > extreme cases we might evict the smallest transactions in the list. I > think on average we'd evict transactions with average size, which seems > OK for this use case. > > Anyway, I don't think we need to be distracted with this. I mentioned it > merely to show it was considered, but the heap seems to work well > enough, and in the end is even simpler because the complexity is hidden > outside reorderbuffer. > > >> > >> The one thing I'm a bit concerned about is the threshold used to start > >> using binary heap - these thresholds with binary decisions may easily > >> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior > >> with significant runtime change (e.g. you add/remove one transaction and > >> the code takes a much more expensive path). The value (1024) seems > >> rather arbitrary, I wonder if there's something to justify that choice. > > > > True. 1024 seems small to me. In my environment, I started to see a > > big difference from around 40000 transactions. But it varies depending > > on environments and workloads. > > > > I think that this performance problem we're addressing doesn't > > normally happen as long as all transactions being decoded are > > top-level transactions. Otherwise, we also need to improve > > ReorderBufferLargestStreamableTopTXN(). Given this fact, I think > > max_connections = 1024 is a possible value in some systems, and I've > > observed such systems sometimes. On the other hand, I've observed > > > 5000 in just a few cases, and having more than 5000 transactions in > > ReorderBuffer seems unlikely to happen without subtransactions. I > > think we can say it's an extreme case, the number is still an > > arbitrary number though. > > > > Or probably we can compute the threshold based on max_connections, > > e.g., max_connections * 10. That way, we can ensure that users won't > > incur the max-heap maintenance costs as long as they don't use > > subtransactions. > > > > Tying this to max_connections seems like an interesting option. It'd > make this adaptive to a system. I haven't thought about the exact value > (m_c * 10), but it seems better than arbitrary hard-coded values.
I've updated the patch accordingly, using MaxConnections for now. I've also updated some comments and commit messages and added typedef.list changes. > > >> > >> In any case, I agree it'd be good to have some dampening factor, to > >> reduce the risk of trashing because of adding/removing a single > >> transaction to the decoding. > >> > >> > >> related stuff / GenerationContext > >> --------------------------------- > >> > >> It's not the fault of this patch, but this reminds me I have some doubts > >> about how the eviction interferes with using the GenerationContext for > >> some of the data. I suspect we can easily get into a situation where we > >> evict the largest transaction, but that doesn't actually reduce the > >> memory usage at all, because the memory context blocks are shared with > >> some other transactions and don't get 100% empty (so we can't release > >> them). But it's actually worse, because GenerationContext does not even > >> reuse this memory. So do we even gain anything by the eviction? > >> > >> When the earlier patch versions also considered age of the transaction, > >> to try evicting the older ones first, I think that was interesting. I > >> think we may want to do something like this even with the binary heap. > > > > Thank you for raising this issue. This is one of the highest priority > > items in my backlog. We've seen cases where the logical decoding uses > > much more memory than logical_decoding_work_mem value[1][2] (e.g. it > > used 4GB memory even though the logical_decoding_work_mem was 256kB). > > I think that the problem would still happen even with this improvement > > on the eviction. > > > > I believe these are separate problems we can address, and evicting > > large transactions first would still be the right strategy. We might > > want to improve how we store changes in memory contexts. For example, > > it might be worth having per-transaction memory context so that we can > > actually free memory blocks by the eviction. We can discuss it in a > > separate thread. > > > > Yes, I think using per-transaction context for large transactions might > work. I don't think we want too many contexts, so we'd start with the > shared context, and then at some point (when the transaction exceeds say > 5% of the memory limit) we'd switch it to a separate one. > > But that's a matter for a separate patch, so let's discuss elsewhere. +1 > > >> > >> For example, a system may be doing a lot of eviction / spilling with > >> logical_decoding_work_mem=64MB, but setting 128MB may completely > >> eliminate that. Of course, if there are large transactions, this may not > >> be possible (the GUC would have to exceed RAM). But I don't think that's > >> very common, the incidents that I've observed were often resolved by > >> bumping the logical_decoding_work_mem by a little bit. > >> > >> I wonder if there's something we might do to help users to tune this. We > >> should be able to measure the "peak" memory usage (how much memory we'd > >> need to not spill), so maybe we could log that as a WARNING, similarly > >> to checkpoints - there we only log "checkpoints too frequent, tune WAL > >> limits", but perhaps we might do more here? Or maybe we could add the > >> watermark to the system catalog? > > > > Interesting ideas. > > > > The statistics such as spill_count shown in pg_stat_replication_slots > > view could already give hints to users to increase the > > logical_decoding_work_mem. In addition to that, it's an interesting > > idea to have the high water mark in the view. > > > > The spill statistics are useful, but I'm not sure it can answer the main > question: > > How high would the memory limit need to be to not spill? > > Maybe there's something we can measure / log to help with this. Right. I like the idea of the high watermark. The pg_stat_replication_slots would be the place to store such information. Since the reorder buffer evicts or streams transactions anyway based on the logical_decoding_work_mem, probably we need to compute the maximum amount of data in the reorder buffer at one point in time while assuming no transactions were evicted and streamed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
v7-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data
v7-0001-Make-binaryheap-enlargeable.patch
Description: Binary data
v7-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data