Hi Alexey & Smile,

JM & RM are located in the same process, thus it's unlikely a network
issue. Such timeouts are usually caused by one of the two endpoints not
responding timely.

Some common causes:
- The process is under severe GC pressure. You can check the GC logs for
the pressure.
- Insufficient CPU resource. You may check the cpu workload of the physical
machine (standalone) or pod/container (K8s/Yarn).
- Busy RPC main thread. Even if there's sufficient CPU resources (multiple
cores), the processing capacity can be limited by the single-pointed RPC
main threads. This is usually observed for large scale jobs (in terms of
number of vertices and parallelism). In that case, we would have to
increase the heartbeat timeout.

Thank you~

Xintong Song



On Mon, May 17, 2021 at 11:12 AM Smile <letters_sm...@163.com> wrote:

> JM log shows this:
>
> INFO  org.apache.flink.yarn.YarnResourceManager                     - The
> heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out.
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to