Right, good catch, Enrico. The issue (#1063) description says: 

> PendingAddOp:maybeRecycle()->recycle() keeps the buffer until writeComplete() 
> is called for each bookie write. We need to keep this buffer only until it is 
> successfully
> transferred by netty. In the current code, the write is retired only if
> disableEnsembleChangeFeature is enabled. Otherwise, there is no point in 
> keeping
> this buffer around.

JV, the author of the PR, says also the following to Sijie:

> toSend buffer is not needed for retries as we discussed on slack.


I don't know what the reason is. JV, Sijie, it has been a while back, but 
perhaps you guys can elaborate? Specifically, I'm trying to understand what is 
the guarantee that BK is currently offering for a configuration in which WQ > 
AQ. I'd think that we guarantee that an entry that is acknowledged is 
eventually written WQ ways and that it is observable by readers when the ledger 
is closed.

-Flavio

> On 14 Jan 2021, at 18:34, Enrico Olivelli <eolive...@gmail.com> wrote:
> 
> Flavio
> 
> Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <f...@apache.org>
> ha scritto:
> 
>> Using your example, the PendindAddOp should remain active until there are
>> 3 copies of the add entry. The client can ack back once it receives two
>> positive acks from bookies, but it shouldn't declare the add entry done at
>> that point.
>> 
> 
> Probably this behaviour has been broken by this commit
> https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c
> 
> My understanding is that as soon as we reach AQ we are discarding the
> "toSend" buffer and we cannot retry the write anymore
> 
> Enrico
> 
> 
>> 
>> There is the case that the third bookie is slow, but it could have failed
>> altogether, in which case the entry needs to be replicated in a new bookie.
>> 
>> -Flavio
>> 
>> On 13 Jan 2021, at 17:28, Enrico Olivelli <eolive...@gmail.com> wrote:
>> 
>> 
>> 
>> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <f...@apache.org>
>> ha scritto:
>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> 
>>> Is there an issue for this? If there isn't, then perhaps we can start
>>> that way.
>>> 
>>> And as soon as the application is notified of the result of the write
>>> (success or failure) we are releasing the reference to the payload (as I
>>> have shown in this email thread),
>>> so in theory the application has full control over the retained memory
>>> and it can apply its own memory management mechanisms
>>> 
>>> 
>>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>>> reply, then it is possible that the entry is not going to be written to
>>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>>> application? The contract as I understand it is that an entry is to be
>>> replicated |WQ| ways, even though the application is willing to receive a
>>> confirmation after |AQ| bookie responses.
>>> 
>>> What am I missing?
>>> 
>> 
>> If I am not wrong in reading PendingAddOp code currently we do it this
>> way, say we run with 3-3-2:
>> - enqueue the write request to the 3 PerChannelBookieClients
>> - as soon as we receive 2 confirmations we trigger the callback and
>> discard the payload
>> 
>> so if the first 2 confirmations arrive before we write to the socket
>> (enqueue the payload on Netty channel actually) of the third bookie, we are
>> not sending the entry to the 3rd bookie at all.
>> This should not happen because we serialize the operations per-ledger (by
>> sticking them to one thread), so you cannot process the incoming acks from
>> the first two bookies while executing PendingAddOp write loop.
>> So we are giving a chance to every bookie to receive the entry, if it is
>> in good health (bookie, network...)
>> 
>> Enrico
>> 
>> 
>> 
>>> 
>>> -Flavio
>>> 
>>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eolive...@gmail.com> wrote:
>>> 
>>> Flavio
>>> 
>>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <f...@apache.org>
>>> ha scritto:
>>> 
>>>> I have observed the issue that Matteo describes and I also attributed
>>>> the problem to the absence of a back pressure mechanism in the client.
>>>> Issue #2497 was not about that, though. There was some corruption going on
>>>> that was leading to the server receiving garbage.
>>>> 
>>> 
>>> Correct, #2497 is not about the topic of this email, I just mentioned it
>>> because the discussion started from that comment from Matteo.
>>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> Regarding the writer side of this story and memory usage,
>>> we are not performing copies of the original payload that the caller is
>>> passing, in case of a ByteBuf
>>> see PendingAddOp
>>> 
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>>> and here, we simply wrap it in a ByteBufList
>>> 
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>> 
>>> And as soon as the application is notified of the result of the write
>>> (success or failure) we are releasing the reference to the payload (as I
>>> have shown in this email thread),
>>> so in theory the application has full control over the retained memory
>>> and it can apply its own memory management mechanisms
>>> 
>>> 
>>> Enrico
>>> 
>>> 
>>>> -Flavio
>>>> 
>>>>> On 8 Jan 2021, at 22:47, Matteo Merli <mme...@apache.org> wrote:
>>>>> 
>>>>> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolive...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hi Matteo,
>>>>>> in this comment you are talking about an issue you saw when WQ is
>>>> greater that AQ
>>>>>> 
>>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>>>>>> 
>>>>>> IIUC you are saying that if one bookie is slow the client continues
>>>> to accumulate references to the entries that still have not received the
>>>> confirmation from it.
>>>>>> I think that this is correct.
>>>>>> 
>>>>>> Have you seen problems in production related to this scenario ?
>>>>>> Can you tell more about them ?
>>>>> 
>>>>> Yes, for simplicity, assume e=3, w=3, a=2.
>>>>> 
>>>>> If one bookie is slow (not down, just slow), the BK client will the
>>>>> acks to the user that the entries are written after the first 2 acks.
>>>>> In the meantime, it will keep waiting for the 3rd bookie to respond.
>>>>> If the bookie responds within the timeout, the entries can now be
>>>>> dropped from memory, otherwise the write will timeout internally and
>>>>> it will get replayed to a new bookie.
>>>>> 
>>>>> In both cases, the amount of memory used in the client will max at
>>>>> "throughput" * "timeout". This can be a large amount of memory and
>>>>> easily cause OOM errors.
>>>>> 
>>>>> Part of the problem is that it cannot be solved from outside the BK
>>>>> client, since there's no visibility on what entries have 2 or 3 acks
>>>>> and therefore it's not possible to apply backpressure. Instead,
>>>>> there should be a backpressure mechanism in the BK client itself to
>>>>> prevent this kind of issue.
>>>>> One possibility there could be to use the same approach as described
>>>>> in
>>>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>>>> ,
>>>>> giving a max memory limit per BK client instance and throttling
>>>>> everything after the quota is reached.
>>>>> 
>>>>> 
>>>>> Matteo
>>> 
>>> 
>> 

Reply via email to