Re: No resource available error while testing HA

2019-03-15 Thread Averell
Hi Gary, Thanks for the answer. I missed your most recent answer in this thread too. However, my last question Averell wrote > How about changing the configuration of the Flink job itself during > runtime? > What I have to do now is to take a savepoint, stop the job, change the > configuration,

Re: No resource available error while testing HA

2019-02-14 Thread Gary Yao
Hi Averell, The TM containers fetch the Flink binaries and config files form HDFS (or another DFS if configured) [1]. I think you should be able to change the log level by patching the logback configuration in HDFS, and kill all Flink containers on all hosts. If you are running an HA setup, your c

Re: No resource available error while testing HA

2019-02-13 Thread Averell
Hi Gary, Thanks for the suggestion. How about changing the configuration of the Flink job itself during runtime? What I have to do now is to take a savepoint, stop the job, change the configuration, and then restore the job from the save point. Is there any easier way to do that? Thanks and r

Re: No resource available error while testing HA

2019-02-11 Thread Gary Yao
Hi Averell, Logback has this feature [1] but is not enabled out of the box. You will have to enable the JMX agent by setting the com.sun.management.jmxremote system property [2][3]. I have not tried this out, though. Best, Gary [1] https://logback.qos.ch/manual/jmxConfig.html [2] https://docs.or

Re: No resource available error while testing HA

2019-02-07 Thread Averell
Hi Gary, I am trying to reproduce that problem. BTW, is that possible to change log level (I'm using logback) for a running job? Thanks and regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: No resource available error while testing HA

2019-02-06 Thread Gary Yao
Hi Averell, That log file does not look complete. I do not see any INFO level log messages such as [1]. Best, Gary [1] https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L544 On Fri, Feb 1, 2019 a

Re: No resource available error while testing HA

2019-01-31 Thread Averell
Hi Gary, I faced a similar problem yesterday, but don't know what was the cause yet. The situation that I observed is as follow: - At about 2:57, one of my EMR execution node (IP ...99) got disconnected from YARN resource manager (on RM I could not see that node anymore), despite that the node wa

Re: No resource available error while testing HA

2019-01-29 Thread Averell
Hi Gary, Thanks for the help. Gary Yao-3 wrote > You are writing that it takes YARN 10 minutes to restart the application > master (AM). However, in my experiments the AM container is restarted > within a > few seconds when after killing the process. If in your setup YARN actually > needs 10 minu

Re: No resource available error while testing HA

2019-01-29 Thread Gary Yao
Hi Averell, > Is there any way to avoid this? As if I run this as an AWS EMR job, the job > would be considered failed, while it is actually be restored automatically by > YARN after 10 minutes). You are writing that it takes YARN 10 minutes to restart the application master (AM). However, in my

Re: No resource available error while testing HA

2019-01-25 Thread Averell
Hi Gary, Yes, my problem mentioned in the original post had been resolved by correcting the zookeeper connection string. I have two other relevant questions, if you have time, please help: 1. Regarding JM high availability, when I shut down the host having JM running, YARN would detect that miss

Re: No resource available error while testing HA

2019-01-24 Thread Gary Yao
Hi Averell, > Then I have another question: when JM cannot start/connect to the JM on .88, > why didn't it try on .82 where resource are still available? When you are deploying on YARN, the TM container placement is decided by the YARN scheduler and not by Flink. Without seeing the complete logs,

Re: No resource available error while testing HA

2019-01-23 Thread Averell
Hi Gary, Thanks for your support. I use flink 1.7.0. I will try to test without that -n. Here below are the JM log (on server .82) and TM log (on server .88). I'm sorry that I missed that TM log before asking, had a thought that it would not relevant. I just fixed the issue with connection to zoo

Re: No resource available error while testing HA

2019-01-23 Thread Gary Yao
Hi Averell, What Flink version are you using? Can you attach the full logs from JM and TMs? Since Flink 1.5, the -n parameter (number of taskmanagers) should be omitted unless you are in legacy mode [1]. > As per that screenshot, it looks like there are 2 tasks manager still > running (one on eac