Hello,

Today we encountered an issue where our Flink job request for Yarn container 
infinitely. In the JM log as below, there were errors when starting TMs (caused 
by underlying HDFS errors). So the allocated container failed and the job kept 
requesting for new containers. The failed containers were also not returned the 
the Yarn, so this job quickly exhausted our Yarn resources. 

Is there any way we can avoid such behavior? Thank you!

————————
JM log:

INFO  org.apache.flink.yarn.YarnResourceManager                     - Creating 
container launch context for TaskManagers
INFO  org.apache.flink.yarn.YarnResourceManager                     - Starting 
TaskManagers
INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  
- Opening proxy : xxx.yyy
ERROR org.apache.flink.yarn.YarnResourceManager                     - Could not 
start TaskManager in container container_e12345.
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start 
container.
....
INFO  org.apache.flink.yarn.YarnResourceManager                     - 
Requesting new TaskExecutor container with resources <memory:16384, vCores:4>. 
Number pending requests 19.
INFO  org.apache.flink.yarn.YarnResourceManager                     - Received 
new container: container_e195_1553781735010_27100_01_000136 - Remaining pending 
container requests: 19
————————

Thanks,
Qi

Reply via email to