Re: Unbounded memory usage for WQ > AQ ?

Flavio Junqueira Thu, 14 Jan 2021 08:56:30 -0800

Using your example, the PendindAddOp should remain active until there are 3 
copies of the add entry. The client can ack back once it receives two positive 
acks from bookies, but it shouldn't declare the add entry done at that point.


There is the case that the third bookie is slow, but it could have failed 
altogether, in which case the entry needs to be replicated in a new bookie.

-Flavio 

> On 13 Jan 2021, at 17:28, Enrico Olivelli <eolive...@gmail.com> wrote:
> 
> 
> 
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <f...@apache.org 
> <mailto:f...@apache.org>> ha scritto:
>> We should work on some kind of back-pressure mechanism for the client, but I 
>> am not sure about which kind of support we should provide at BK level
> 
> Is there an issue for this? If there isn't, then perhaps we can start that 
> way.
> 
>> And as soon as the application is notified of the result of the write 
>> (success or failure) we are releasing the reference to the payload (as I 
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory and 
>> it can apply its own memory management mechanisms 
> 
> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies 
> reply, then it is possible that the entry is not going to be written to |WQ| 
> - |AQ| bookies because the entry data might have been reclaimed by the 
> application? The contract as I understand it is that an entry is to be 
> replicated |WQ| ways, even though the application is willing to receive a 
> confirmation after |AQ| bookie responses.
> 
> What am I missing?
> 
> If I am not wrong in reading PendingAddOp code currently we do it this way, 
> say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and discard 
> the payload
> 
> so if the first 2 confirmations arrive before we write to the socket (enqueue 
> the payload on Netty channel actually) of the third bookie, we are not 
> sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by 
> sticking them to one thread), so you cannot process the incoming acks from 
> the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is in 
> good health (bookie, network...)
> 
> Enrico  
> 
>  
> 
> -Flavio  
> 
>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eolive...@gmail.com 
>> <mailto:eolive...@gmail.com>> wrote:
>> 
>> Flavio
>> 
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <f...@apache.org 
>> <mailto:f...@apache.org>> ha scritto:
>> I have observed the issue that Matteo describes and I also attributed the 
>> problem to the absence of a back pressure mechanism in the client. Issue 
>> #2497 was not about that, though. There was some corruption going on that 
>> was leading to the server receiving garbage.
>> 
>> Correct, #2497 is not about the topic of this email, I just mentioned it 
>> because the discussion started from that comment from Matteo.
>> 
>> We should work on some kind of back-pressure mechanism for the client, but I 
>> am not sure about which kind of support we should provide at BK level
>> 
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is 
>> passing, in case of a ByteBuf
>> see PendingAddOp
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>>  
>> <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263>
>> and here, we simply wrap it in a ByteBufList 
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>  
>> <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116>
>> 
>> And as soon as the application is notified of the result of the write 
>> (success or failure) we are releasing the reference to the payload (as I 
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>> 
>> 
>> Enrico
>> 
>> 
>> -Flavio 
>> 
>> > On 8 Jan 2021, at 22:47, Matteo Merli <mme...@apache.org 
>> > <mailto:mme...@apache.org>> wrote:
>> > 
>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolive...@gmail.com 
>> > <mailto:eolive...@gmail.com>> wrote:
>> >> 
>> >> Hi Matteo,
>> >> in this comment you are talking about an issue you saw when WQ is greater 
>> >> that AQ
>> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 
>> >> <https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246>
>> >> 
>> >> IIUC you are saying that if one bookie is slow the client continues to 
>> >> accumulate references to the entries that still have not received the 
>> >> confirmation from it.
>> >> I think that this is correct.
>> >> 
>> >> Have you seen problems in production related to this scenario ?
>> >> Can you tell more about them ?
>> > 
>> > Yes, for simplicity, assume e=3, w=3, a=2.
>> > 
>> > If one bookie is slow (not down, just slow), the BK client will the
>> > acks to the user that the entries are written after the first 2 acks.
>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>> > If the bookie responds within the timeout, the entries can now be
>> > dropped from memory, otherwise the write will timeout internally and
>> > it will get replayed to a new bookie.
>> > 
>> > In both cases, the amount of memory used in the client will max at
>> > "throughput" * "timeout". This can be a large amount of memory and
>> > easily cause OOM errors.
>> > 
>> > Part of the problem is that it cannot be solved from outside the BK
>> > client, since there's no visibility on what entries have 2 or 3 acks
>> > and therefore it's not possible to apply backpressure. Instead,
>> > there should be a backpressure mechanism in the BK client itself to
>> > prevent this kind of issue.
>> > One possibility there could be to use the same approach as described
>> > in 
>> > https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>> >  
>> > <https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits>,
>> > giving a max memory limit per BK client instance and throttling
>> > everything after the quota is reached.
>> > 
>> > 
>> > Matteo

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to