Hi Gary,

I faced a similar problem yesterday, but don't know what was the cause yet.
The situation that I observed is as follow:
 - At about 2:57, one of my EMR execution node (IP ...99) got disconnected
from YARN resource manager (on RM I could not see that node anymore),
despite that the node was still running. <<< This is another issue, but I
believe it is with YARN.
 - About 8 hours after that (between 10:00 - 11:00), I turned the
problematic EMR core node off. AWS spun up another node and added it to the
cluster to replace that. YARN RM soon recognized the new node and added it
to its list of available nodes.
However, the JM seemed to not (able to) do anything after that. It kept
trying to start the job, failed after the timeout and that "no resource
available" exception again and again. No jobmanager logs recorded since
2:57:15 though.

I am attaching the logs collected via "yarn logs --applicationId <appId>
here. But it seems I still missed something.

I am using Flink 1.7.1, with yarn-site configuration
yarn.resourcemanager.am.max-attempts=5. Flink configurations are all of the
default values.

Thanks and best regards,
Averell flink.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/flink.log>
  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to