[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM

ASF GitHub Bot (JIRA) Sun, 07 Oct 2018 20:55:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641328#comment-16641328
 ]


ASF GitHub Bot commented on FLINK-10319:
----------------------------------------

TisonKun commented on issue #6680: [FLINK-10319] [runtime] Too many 
requestPartitionState would crash JM
URL: https://github.com/apache/flink/pull/6680#issuecomment-427716261
 
 
   @tillrohrmann it is better to say that `JobMaster` will be overwhelmed by 
too many rpc request.
   
   This issue is filed during a benchmark of the job scheduling performance 
with a 2000x2000 ALL-to-ALL streaming(EAGER) job. The input data is empty so 
that the tasks finishes soon after started.
   
   In this case JM shows slow RPC responses and TM/RM heartbeats to JM will 
finally timeout. Digging out the reason, there are ~2,000,000 
`requestPartitionState` messages triggered by 
`triggerPartitionProducerStateCheck` in a short time, which overwhelms JM RPC 
main thread. This is due to downstream tasks can be started earlier than 
upstream tasks in EAGER scheduling.
   
   For you second question, the task can just keep waiting for a while and 
retrying if the partition does not exist. There are two cases when the 
partition does not exist: 1. the partition is not started yet 2. the partition 
is failed. In case 1, retry works. In case 2, a task failover will soon happen 
and cancel the downstream tasks as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Too many requestPartitionState would crash JM
> ---------------------------------------------
>
>                 Key: FLINK-10319
>                 URL: https://issues.apache.org/jira/browse/FLINK-10319
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.7.0
>            Reporter: tison
>            Assignee: tison
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>
> Do not requestPartitionState from JM on partition request fail, which may 
> generate too many RPC requests and block JM.
> We gain little benefit to check what state producer is in, which in the other 
> hand crash JM by too many RPC requests. Task could always 
> retriggerPartitionRequest from its InputGate, it would be fail if the 
> producer has gone and succeed if the producer alive. Anyway, no need to ask 
> for JM for help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10319) Too many requestPartitionState would crash JM

Reply via email to