Re: Singal task backpressure problem with Credit-based Flow Control

Piotr Nowojski Wed, 27 May 2020 06:37:25 -0700

Hi Weihua,

Good to hear that you have found the problem. Let us know if you find some 
other problems after all.


Piotrek

> On 27 May 2020, at 14:18, Weihua Hu <huweihua....@gmail.com> wrote:
> 
> Hi Piotrek,
> 
> Thanks for your suggestions, I found some network issues which seems to be 
> the cause of back pressure.
> 
> Best
> Weihua Hu
> 
>> 2020年5月26日 02:54，Piotr Nowojski <pi...@ververica.com 
>> <mailto:pi...@ververica.com>> 写道：
>> 
>> Hi Weihua,
>> 
>> > After dumping the memory and analyzing it, I found:
>> > Sink (121)'s RemoteInputChannel.unannouncedCredit = 0,
>> > Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0.
>> > This is not consistent with my understanding of the Flink network 
>> > transmission mechanism.
>> 
>> It probably is consistent. Downstream receiver unannounced all of the 
>> credits, and it’s simply waiting for the data to arrive, while upstream 
>> sender is waiting for the data to be sent down the stream.
>> 
>> Stack trace you posted confirms that the sink you posted has empty input 
>> buffer - it’s waiting for input data. Assuming rescale partitoning works as 
>> expected and indeed node 242 is connected to node 121, it implies the 
>> bottleneck is your data exchange between those two tasks. It could be
>> 
>> - network bottleneck (slow network? Packet losses?)
>> - machine swapping/long GC pauses (If upstream node is experiencing long 
>> pauses it might show up like this)
>> - cpu bottleneck in the network stack (frequent flushing? SSL?)
>> - some resource competition (too high parallelism for given number of 
>> machines)
>> - netty threads are not keeping up
>> 
>> It’s hard to say what’s the problem without looking at the resource usage 
>> (CPU/Network/Memory/Disk IO), GC logs, code profiling results.
>> 
>> Piotrek
>> 
>> PS Zhijiang:
>> 
>> RescalePartitioner in this case should be connect just two upstream subtasks 
>> with one downstream sink. Upstream subtasks N and N+1 should be connected to 
>> sink with N/2 id.
>> 
>>> On 25 May 2020, at 04:39, Weihua Hu <huweihua....@gmail.com 
>>> <mailto:huweihua....@gmail.com>> wrote:
>>> 
>>> Hi, Zhijiang
>>> 
>>> I understand the normal credit-based backpressure mechanism. as usual the 
>>> Sink inPoolUsage will be full, and the task stack will also have some 
>>> information. 
>>> but this time is not the same. The Sink inPoolUsage is 0. 
>>> I also checked the stack. The Map is waiting 
>>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment
>>> The Sink is waiting data to deal, this is not very in line with 
>>> expectations.
>>> 
>>> 
>>> <粘贴的图形-2.tiff>
>>> 
>>> <粘贴的图形-1.tiff>
>>> 
>>> 
>>> 
>>> Best
>>> Weihua Hu
>>> 
>>>> 2020年5月24日 21:57，Zhijiang <wangzhijiang...@aliyun.com 
>>>> <mailto:wangzhijiang...@aliyun.com>> 写道：
>>>> 
>>>> Hi Weihua,
>>>> 
>>>> From your below info, it is with the expectation in credit-based flow 
>>>> control. 
>>>> 
>>>> I guess one of the sink parallelism causes the backpressure, so you will 
>>>> see that there are no available credits on Sink side and
>>>> the outPoolUsage of Map is almost 100%. It really reflects the 
>>>> credit-based states in the case of backpressure.
>>>> 
>>>> If you want to analyze the root cause of backpressure, you can trace the 
>>>> task stack of respective Sink parallelism to find which operation costs 
>>>> much,
>>>> then you can increase the parallelism or improve the UDF(if have 
>>>> bottleneck) to have a try. In addition, i am not sure why you choose 
>>>> rescale to shuffle data among operators. The default
>>>> forward mode can gain really good performance by default if you adjusting 
>>>> the same parallelism among them.
>>>> 
>>>> Best,
>>>> Zhijiang
>>>> ------------------------------------------------------------------
>>>> From:Weihua Hu <huweihua....@gmail.com <mailto:huweihua....@gmail.com>>
>>>> Send Time:2020年5月24日(星期日) 18:32
>>>> To:user <user@flink.apache.org <mailto:user@flink.apache.org>>
>>>> Subject:Singal task backpressure problem with Credit-based Flow Control
>>>> 
>>>> Hi, all
>>>> 
>>>> I ran into a weird single Task BackPressure problem.
>>>> 
>>>> JobInfo:
>>>>     DAG: Source (1000)-> Map (2000)-> Sink (1000), which is linked via 
>>>> rescale. 
>>>>     Flink version: 1.9.0
>>>>     
>>>> There is no related info in jobmanager/taskamanger log.
>>>> 
>>>> Through Metrics, I see that Map (242) 's outPoolUsage is full, but its 
>>>> downstream Sink (121)' s inPoolUsage is 0.
>>>> 
>>>> After dumping the memory and analyzing it, I found:
>>>> Sink (121)'s RemoteInputChannel.unannouncedCredit = 0,
>>>> Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0.
>>>> This is not consistent with my understanding of the Flink network 
>>>> transmission mechanism.
>>>> 
>>>> Can someone help me? Thanks a lot.
>>>> 
>>>> 
>>>> Best
>>>> Weihua Hu
>>>> 
>>>> 
>>> 
>> 
>

Re: Singal task backpressure problem with Credit-based Flow Control

Reply via email to