Seems something bad happened in the task managers and led to heartbeat timeouts. These TMs were not released by flink but lost connections with the master node. I think you need to check the TM log to see what happens there.
Thanks, Zhu Zhu seeksst <seek...@163.com> 于2020年4月26日周日 下午2:13写道: > Thank you for your reply. > > > I forget providing some information. I use 'run -m yarn-cluster’ to start > my job, which means ‘run a single flink job on yarn’. after one minute, the > job throw exception: java.util.concurrent.TimeoutException: The heartbeat > of TaskManager with id container_1581388570291_0133_01_000003 timed out. > > > First, the job start with two taskManager: > org.apache.flink.yarn.YarnResourceManager - Registering TaskManager with > ResourceID container_1581388570291_0133_01_000003 (xxx.xxx.xxx.xxx:38211) > org.apache.flink.yarn.YarnResourceManager - Registering TaskManager with > ResourceID container_1581388570291_0133_01_000002 (xxx.xxx.xxx.xxx:33715) > > Then, 003 timeout, and throw with exception: > org.apache.flink.yarn.YarnResourceManager - The heartbeat of TaskManager > with id container_1581388570291_0133_01_000003 timed out. > org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor > connection container_1581388570291_0133_01_000003 because: The heartbeat of > TaskManager with id container_1581388570291_0133_01_000003 timed out. > > > Switch RUNNING TO CANCELING, Swith CANCELING To CANCELED. > > After 10 Seconds(I used fixedDelayRestart), Switch Restarting TO RUNNING. > > switched from CREATED to SCHEDULED. > > Requesting new slot [SlotRequestId{c6c137acf7ef9fd639157f0e9495fe42}] and > profile ResourceProfile{UNKNOWN} from resource manager. > Requesting new TaskExecutor container with resources <memory:12288, > vCores:3>. Number pending requests 1. > Requesting new TaskExecutor container with resources <memory:12288, > vCores:3>. Number pending requests 2. > > The heartbeat of TaskManager with id > container_1581388570291_0133_01_000002 timed out. > Closing TaskExecutor connection container_1581388570291_0133_01_000002 > because: The heartbeat of TaskManager with id > container_1581388570291_0133_01_000002 timed out. > > > org.apache.flink.yarn.YarnResourceManager - Received 1 containers with 2 > pending container requests. > org.apache.flink.yarn.YarnResourceManager - Removing container request > Capability[<memory:12288, vCores:3>]Priority[1]. Pending container requests > 1. > org.apache.flink.yarn.YarnResourceManager - TaskExecutor > container_1581388570291_0133_01_000004 will be started > org.apache.flink.yarn.YarnResourceManager - Registering TaskManager with > ResourceID container_1581388570291_0133_01_000004 (xxx.xxx.xxx.xxx:40463) > akka.remote.transport.netty.NettyTransport - Remote connection to [null] > failed with java.net.ConnectException: Connection refused:xxx.xxx.xxx.xxx: > 33715 > > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate the required slot within slot request timeout. Please > make sure that the cluster has enough resources. > > And it restart again. switched from SCHEDULED to CANCELING. switched from > CANCELING to CANCELED. > 10 Seconds later, switched from CREATED to SCHEDULED. > akka.remote.transport.netty.NettyTransport - Remote connection to [null] > failed with java.net.ConnectException: Connection refused: > (xxx.xxx.xxx.xxx:33715) > > the port 33715 is container_1581388570291_0133_01_000002, it was closed > already. > then > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate the required slot within slot request timeout. Please > make sure that the cluster has enough resources. > > > 10 seconds later, the third times restart, it only report > NoResourceAvailableException, and nothing about 33715. > > Now, the job only have one task manager 004, but yarn resource has nothing > left. last email has no task manager and no resource. > > I don’t know what make this happen. there is enough resource if all old > taskmanager was released, > sometimes the job can create one, sometimes none. this never happen on > 1.8.2, i use same cluster and job, just different flink version. > the job may fail and auto-recovery. but in 1.10.0, it seems yarn miss some > taskmanager fail, and not release resource, so the new one > can’t be created. > > What’s more should i do? > Thanks a lot. > 原始邮件 > *发件人:* Zhu Zhu<reed...@gmail.com> > *收件人:* seeksst<seek...@163.com> > *抄送:* user<user@flink.apache.org> > *发送时间:* 2020年4月26日(周日) 11:52 > *主题:* Re: Flink 1.10.0 failover > > Sorry I did not quite understand the problem. > Do you mean a failed job does not release resources to yarn? > - if so, is the job in restarting process? A job in recovery will reuse > the slots so they will not be release. > Or a failed job cannot acquire slots when it is restarted in auto-recovery? > - if so, normally the job should be in a loop like (restarting tasks -> > allocating slots -> failed due to not be able to acquire enough slots -> > restarting task -> ...). Would you check whether the job is in such a loop? > Or the job cannot allocate enough slots even if the cluster has enough > resource? > > Thanks, > Zhu Zhu > > > > seeksst <seek...@163.com> 于2020年4月26日周日 上午11:21写道: > >> Hi, >> >> >> Recently, I find a problem when job failed in 1.10.0, flink didn’t >> release resource first. >> >> >> >> You can see I used flink on yarn, and it doesn’t allocate task >> manager, beacause no more memory left. >> >> If i cancel the job, the cluster has more memory. >> >> In 1.8.2, the job will restart normally, is this a bug? >> >> Thanks. >> >