Right, good catch, Enrico. The issue (#1063) description says: > PendingAddOp:maybeRecycle()->recycle() keeps the buffer until writeComplete() > is called for each bookie write. We need to keep this buffer only until it is > successfully > transferred by netty. In the current code, the write is retired only if > disableEnsembleChangeFeature is enabled. Otherwise, there is no point in > keeping > this buffer around.
JV, the author of the PR, says also the following to Sijie: > toSend buffer is not needed for retries as we discussed on slack. I don't know what the reason is. JV, Sijie, it has been a while back, but perhaps you guys can elaborate? Specifically, I'm trying to understand what is the guarantee that BK is currently offering for a configuration in which WQ > AQ. I'd think that we guarantee that an entry that is acknowledged is eventually written WQ ways and that it is observable by readers when the ledger is closed. -Flavio > On 14 Jan 2021, at 18:34, Enrico Olivelli <eolive...@gmail.com> wrote: > > Flavio > > Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <f...@apache.org> > ha scritto: > >> Using your example, the PendindAddOp should remain active until there are >> 3 copies of the add entry. The client can ack back once it receives two >> positive acks from bookies, but it shouldn't declare the add entry done at >> that point. >> > > Probably this behaviour has been broken by this commit > https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c > > My understanding is that as soon as we reach AQ we are discarding the > "toSend" buffer and we cannot retry the write anymore > > Enrico > > >> >> There is the case that the third bookie is slow, but it could have failed >> altogether, in which case the entry needs to be replicated in a new bookie. >> >> -Flavio >> >> On 13 Jan 2021, at 17:28, Enrico Olivelli <eolive...@gmail.com> wrote: >> >> >> >> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <f...@apache.org> >> ha scritto: >> >>> We should work on some kind of back-pressure mechanism for the client, >>> but I am not sure about which kind of support we should provide at BK level >>> >>> >>> Is there an issue for this? If there isn't, then perhaps we can start >>> that way. >>> >>> And as soon as the application is notified of the result of the write >>> (success or failure) we are releasing the reference to the payload (as I >>> have shown in this email thread), >>> so in theory the application has full control over the retained memory >>> and it can apply its own memory management mechanisms >>> >>> >>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies >>> reply, then it is possible that the entry is not going to be written to >>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the >>> application? The contract as I understand it is that an entry is to be >>> replicated |WQ| ways, even though the application is willing to receive a >>> confirmation after |AQ| bookie responses. >>> >>> What am I missing? >>> >> >> If I am not wrong in reading PendingAddOp code currently we do it this >> way, say we run with 3-3-2: >> - enqueue the write request to the 3 PerChannelBookieClients >> - as soon as we receive 2 confirmations we trigger the callback and >> discard the payload >> >> so if the first 2 confirmations arrive before we write to the socket >> (enqueue the payload on Netty channel actually) of the third bookie, we are >> not sending the entry to the 3rd bookie at all. >> This should not happen because we serialize the operations per-ledger (by >> sticking them to one thread), so you cannot process the incoming acks from >> the first two bookies while executing PendingAddOp write loop. >> So we are giving a chance to every bookie to receive the entry, if it is >> in good health (bookie, network...) >> >> Enrico >> >> >> >>> >>> -Flavio >>> >>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eolive...@gmail.com> wrote: >>> >>> Flavio >>> >>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <f...@apache.org> >>> ha scritto: >>> >>>> I have observed the issue that Matteo describes and I also attributed >>>> the problem to the absence of a back pressure mechanism in the client. >>>> Issue #2497 was not about that, though. There was some corruption going on >>>> that was leading to the server receiving garbage. >>>> >>> >>> Correct, #2497 is not about the topic of this email, I just mentioned it >>> because the discussion started from that comment from Matteo. >>> >>> We should work on some kind of back-pressure mechanism for the client, >>> but I am not sure about which kind of support we should provide at BK level >>> >>> Regarding the writer side of this story and memory usage, >>> we are not performing copies of the original payload that the caller is >>> passing, in case of a ByteBuf >>> see PendingAddOp >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263 >>> and here, we simply wrap it in a ByteBufList >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116 >>> >>> And as soon as the application is notified of the result of the write >>> (success or failure) we are releasing the reference to the payload (as I >>> have shown in this email thread), >>> so in theory the application has full control over the retained memory >>> and it can apply its own memory management mechanisms >>> >>> >>> Enrico >>> >>> >>>> -Flavio >>>> >>>>> On 8 Jan 2021, at 22:47, Matteo Merli <mme...@apache.org> wrote: >>>>> >>>>> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolive...@gmail.com> >>>> wrote: >>>>>> >>>>>> Hi Matteo, >>>>>> in this comment you are talking about an issue you saw when WQ is >>>> greater that AQ >>>>>> >>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 >>>>>> >>>>>> IIUC you are saying that if one bookie is slow the client continues >>>> to accumulate references to the entries that still have not received the >>>> confirmation from it. >>>>>> I think that this is correct. >>>>>> >>>>>> Have you seen problems in production related to this scenario ? >>>>>> Can you tell more about them ? >>>>> >>>>> Yes, for simplicity, assume e=3, w=3, a=2. >>>>> >>>>> If one bookie is slow (not down, just slow), the BK client will the >>>>> acks to the user that the entries are written after the first 2 acks. >>>>> In the meantime, it will keep waiting for the 3rd bookie to respond. >>>>> If the bookie responds within the timeout, the entries can now be >>>>> dropped from memory, otherwise the write will timeout internally and >>>>> it will get replayed to a new bookie. >>>>> >>>>> In both cases, the amount of memory used in the client will max at >>>>> "throughput" * "timeout". This can be a large amount of memory and >>>>> easily cause OOM errors. >>>>> >>>>> Part of the problem is that it cannot be solved from outside the BK >>>>> client, since there's no visibility on what entries have 2 or 3 acks >>>>> and therefore it's not possible to apply backpressure. Instead, >>>>> there should be a backpressure mechanism in the BK client itself to >>>>> prevent this kind of issue. >>>>> One possibility there could be to use the same approach as described >>>>> in >>>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits >>>> , >>>>> giving a max memory limit per BK client instance and throttling >>>>> everything after the quota is reached. >>>>> >>>>> >>>>> Matteo >>> >>> >>