Hi Piotrek,

Thanks for your suggestions, I found some network issues which seems to be the 
cause of back pressure.

Best
Weihua Hu

> 2020年5月26日 02:54,Piotr Nowojski <pi...@ververica.com> 写道:
> 
> Hi Weihua,
> 
> > After dumping the memory and analyzing it, I found:
> > Sink (121)'s RemoteInputChannel.unannouncedCredit = 0,
> > Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0.
> > This is not consistent with my understanding of the Flink network 
> > transmission mechanism.
> 
> It probably is consistent. Downstream receiver unannounced all of the 
> credits, and it’s simply waiting for the data to arrive, while upstream 
> sender is waiting for the data to be sent down the stream.
> 
> Stack trace you posted confirms that the sink you posted has empty input 
> buffer - it’s waiting for input data. Assuming rescale partitoning works as 
> expected and indeed node 242 is connected to node 121, it implies the 
> bottleneck is your data exchange between those two tasks. It could be
> 
> - network bottleneck (slow network? Packet losses?)
> - machine swapping/long GC pauses (If upstream node is experiencing long 
> pauses it might show up like this)
> - cpu bottleneck in the network stack (frequent flushing? SSL?)
> - some resource competition (too high parallelism for given number of 
> machines)
> - netty threads are not keeping up
> 
> It’s hard to say what’s the problem without looking at the resource usage 
> (CPU/Network/Memory/Disk IO), GC logs, code profiling results.
> 
> Piotrek
> 
> PS Zhijiang:
> 
> RescalePartitioner in this case should be connect just two upstream subtasks 
> with one downstream sink. Upstream subtasks N and N+1 should be connected to 
> sink with N/2 id.
> 
>> On 25 May 2020, at 04:39, Weihua Hu <huweihua....@gmail.com 
>> <mailto:huweihua....@gmail.com>> wrote:
>> 
>> Hi, Zhijiang
>> 
>> I understand the normal credit-based backpressure mechanism. as usual the 
>> Sink inPoolUsage will be full, and the task stack will also have some 
>> information. 
>> but this time is not the same. The Sink inPoolUsage is 0. 
>> I also checked the stack. The Map is waiting 
>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment
>> The Sink is waiting data to deal, this is not very in line with expectations.
>> 
>> 
>> <粘贴的图形-2.tiff>
>> 
>> <粘贴的图形-1.tiff>
>> 
>> 
>> 
>> Best
>> Weihua Hu
>> 
>>> 2020年5月24日 21:57,Zhijiang <wangzhijiang...@aliyun.com 
>>> <mailto:wangzhijiang...@aliyun.com>> 写道:
>>> 
>>> Hi Weihua,
>>> 
>>> From your below info, it is with the expectation in credit-based flow 
>>> control. 
>>> 
>>> I guess one of the sink parallelism causes the backpressure, so you will 
>>> see that there are no available credits on Sink side and
>>> the outPoolUsage of Map is almost 100%. It really reflects the credit-based 
>>> states in the case of backpressure.
>>> 
>>> If you want to analyze the root cause of backpressure, you can trace the 
>>> task stack of respective Sink parallelism to find which operation costs 
>>> much,
>>> then you can increase the parallelism or improve the UDF(if have 
>>> bottleneck) to have a try. In addition, i am not sure why you choose 
>>> rescale to shuffle data among operators. The default
>>> forward mode can gain really good performance by default if you adjusting 
>>> the same parallelism among them.
>>> 
>>> Best,
>>> Zhijiang
>>> ------------------------------------------------------------------
>>> From:Weihua Hu <huweihua....@gmail.com <mailto:huweihua....@gmail.com>>
>>> Send Time:2020年5月24日(星期日) 18:32
>>> To:user <user@flink.apache.org <mailto:user@flink.apache.org>>
>>> Subject:Singal task backpressure problem with Credit-based Flow Control
>>> 
>>> Hi, all
>>> 
>>> I ran into a weird single Task BackPressure problem.
>>> 
>>> JobInfo:
>>>     DAG: Source (1000)-> Map (2000)-> Sink (1000), which is linked via 
>>> rescale. 
>>>     Flink version: 1.9.0
>>>     
>>> There is no related info in jobmanager/taskamanger log.
>>> 
>>> Through Metrics, I see that Map (242) 's outPoolUsage is full, but its 
>>> downstream Sink (121)' s inPoolUsage is 0.
>>> 
>>> After dumping the memory and analyzing it, I found:
>>> Sink (121)'s RemoteInputChannel.unannouncedCredit = 0,
>>> Map (242)'s CreditBasedSequenceNumberingViewReader.numCreditsAvailable = 0.
>>> This is not consistent with my understanding of the Flink network 
>>> transmission mechanism.
>>> 
>>> Can someone help me? Thanks a lot.
>>> 
>>> 
>>> Best
>>> Weihua Hu
>>> 
>>> 
>> 
> 

Reply via email to