When I turn off zk's ha configuration and do a fault walkthrough, yarn's resourceManager log comes up with the following message.
2022-07-18 23:42:53,633 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1658129937018_0030 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1658129937018_0030_000001 exited with exitCode: -104 Failing this attempt.Diagnostics: Container [pid=760348,containerID=container_e01_1658129937018_0030_01_000001] is running beyond physical memory limits. Current usage: 80.1 GB of 80 GB physical memory used; 86.1 GB of 640 GB virtual memory used. Killing container. My JM memory is set to 80G, it's hard to imagine an OOM for a component like JM that doesn't run business logic (job parallelism is 3000, with multiple agg operations and sinks) ---- Replied Message ---- | From | Geng Biao<biaoge...@gmail.com> | | Date | 07/18/2022 23:31 | | To | SmileSmile<a511955...@163.com> | | Cc | user<user@flink.apache.org> | | Subject | Re: flink on yarn job always restart | The log shows that “Diagnostics Cluster entrypoint has been closed externally..” So are you trying to kill the YARN cluster entrypoint process directly in the terminal using “kill <pid>”? If users want to kill a TM, they should go to the machine that the TM process resides and kill the TM process. Cluster entrypoint is the driver to launch the flink cluster on YARN, not JM or TM process. The zk HA is for JM(i.e. starting a new JM when previous JM fails) and TM is managed by JM which, IIUC, does not directly interact with zk. It is possible that JM will be restarted repeated (check details in this doc ) due to wrong configuration but it may not be your case here. Best, Biao Geng From: SmileSmile <a511955...@163.com> Date: Monday, July 18, 2022 at 11:08 PM To: biaogeng7 <biaoge...@gmail.com> Cc: user <user@flink.apache.org> Subject: Re: flink on yarn job always restart Thanks for the reply, our scenario was a failure test to see if the job would recover on its own after killing a TM. It turns out that the job gets a SIGNAL 15 hang during the switch from DEPLOYING to INITIALIZING. Because zk's ha appears to restart repeatedly My confusion 1. why does it receive SIGNAL 15 2. is it because of some configuration? (e.g. deploy timeout causing kill?) ---- Replied Message ---- | From | Geng Biao<biaoge...@gmail.com> | | Date | 07/18/2022 22:36 | | To | SmileSmile<a511955...@163.com>、user<user@flink.apache.org> | | Cc | | | Subject | Re: flink on yarn job always restart | Hi, One possible direction is to check your YARN log or TM log to see if the YARN RM kills the TM for some reason(e.g. physical memory is over limit) and as a result, the JM will try to recover the TM repeatedly according to your restart strategy. The snippet of JM logs you provide is usually not the root cause. Best, Biao Geng From: SmileSmile <a511955...@163.com> Date: Monday, July 18, 2022 at 8:46 PM To: user <user@flink.apache.org> Subject: flink on yarn job always restart hi all we meet a situation, parallelism 3000,the job contains multiple agg operation,the job recover from checkpoint or savepoint must be unrecoverable, the job restarts repeatedly jm error logorg.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED S IGNAL 15: SIGTERM. Shutting down as requested. flink version 1.14.5 Have any good ideas for troubleshooting?