Hi Alexey & Smile, JM & RM are located in the same process, thus it's unlikely a network issue. Such timeouts are usually caused by one of the two endpoints not responding timely.
Some common causes: - The process is under severe GC pressure. You can check the GC logs for the pressure. - Insufficient CPU resource. You may check the cpu workload of the physical machine (standalone) or pod/container (K8s/Yarn). - Busy RPC main thread. Even if there's sufficient CPU resources (multiple cores), the processing capacity can be limited by the single-pointed RPC main threads. This is usually observed for large scale jobs (in terms of number of vertices and parallelism). In that case, we would have to increase the heartbeat timeout. Thank you~ Xintong Song On Mon, May 17, 2021 at 11:12 AM Smile <letters_sm...@163.com> wrote: > JM log shows this: > > INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out. > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >