The leader znode is the right one ( it is a binary ) get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock
wFDakka.tcp:// [email protected]:22161/user/jobmanagersrjava.util.UUIDm/J leastSigBitsJ mostSigBitsxpHv So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) . On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[email protected]> wrote: > OK few things > > 2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client > (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC) > > ... > > 2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established > to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session > > 2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request > sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181 > > 2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment > complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, > sessionid = 0x35add547801ea07, negotiated timeout = 40000 > > 2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started. > > 2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting > ZooKeeperLeaderRetrievalService /leader/rest_server_lock. > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: > '/flink_test,F response:: s{47265479496,47265479496, > 1489163688703,1489163688703,0,2,0,0,0,2,60416492885} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: > '/flink_test/da_15,F response:: s{60416492885,60416492885, > 1529755199131,1529755199131,0,5,0,0,0,5,60416521584} > > 2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting > ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: > '/flink_test,F response:: s{47265479496,47265479496, > 1489163688703,1489163688703,0,2,0,0,0,2,60416492885} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: > '/flink_test/da_15,F response:: s{60416492885,60416492885, > 1529755199131,1529755199131,0,5,0,0,0,5,60416521584} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: > '/flink_test/da_15/leader,F response:: s{60416492887,60416492887, > 1529755199191,1529755199191,0,1,0,0,0,1,60416492888} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: > clientPath:/flink_test/da_15/leader/rest_server_lock > serverPath:/flink_test/da_15/leader/rest_server_lock finished:false > header:: 6,3 replyHeader:: 6,60416530560,-101 request:: > '/flink_test/da_15/leader/rest_server_lock,T response:: > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: > '/flink_test,F response:: s{47265479496,47265479496, > 1489163688703,1489163688703,0,2,0,0,0,2,60416492885} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: > '/flink_test/da_15,F response:: s{60416492885,60416492885, > 1529755199131,1529755199131,0,5,0,0,0,5,60416521584} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: > '/flink_test/da_15/leader,F response:: s{60416492887,60416492887, > 1529755199191,1529755199191,0,1,0,0,0,1,60416492888} > > 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: > clientPath:/flink_test/da_15/leader/dispatcher_lock > serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false > header:: 10,3 replyHeader:: 10,60416530560,-101 request:: > '/flink_test/da_15/leader/dispatcher_lock,T response:: > > 2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response... > > Waiting for response... > > 2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for > sessionid: 0x35add547801ea07 after 0ms > > 2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for > sessionid: 0x35add547801ea07 after 0ms > > 2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint. > > 2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown > complete. > > 2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping > ZooKeeperLeaderRetrievalService /leader/rest_server_lock. > > 2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping > ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > > 2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing > > 2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - > backgroundOperationsLoop exiting > > 2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing > > 2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing > > 2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: > 0x35add547801ea07 > > 2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: > 0x35add547801ea07 > > 2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply > sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null > finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: > null response:: null > > 2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for > session: 0x35add547801ea07 > > 2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 > closed > > 2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for > session: 0x35add547801ea07 > > 2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while > closing send thread for session 0x35add547801ea07 : Unable to read > additional data from server sessionid 0x35add547801ea07, likely server has > closed socket > > 2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the > command. > > org.apache.flink.util.FlinkException: Failed to retrieve job list. > > at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429) > > at org.apache.flink.client.cli.CliFrontend.lambda$list$0( > CliFrontend.java:412) > > > On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[email protected]> wrote: > >> By the way, in HA set up. >> >> 在 2018年6月26日,下午5:39,zhangminglei <[email protected]> 写道: >> >> Hi, Gary Yao >> >> Once I discovered that there was a change in the ip address[ >> jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that >> will cause the issue. What do you think ? >> >> Cheers >> Minglei >> >> 在 2018年6月26日,下午4:53,Gary Yao <[email protected]> 写道: >> >> Hi Vishal, >> >> Could it be that you are not using the 1.5.0 client? The stacktrace you >> posted >> does not reference valid lines of code in the release-1.5.0-rc6 tag. >> >> If you have a HA setup, the host and port of the leading JM will be >> looked up >> from ZooKeeper before job submission. Therefore, the flink-conf.yaml used >> by the >> client must have the same ZooKeeper configuration as used by the Flink >> cluster. >> >> Best, >> Gary >> >> On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi < >> [email protected]> wrote: >> >>> I think all I need to add is >>> >>> web.port: 8081 >>> rest.port: 8081 >>> >>> to the JM flink conf ? >>> >>> On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi < >>> [email protected]> wrote: >>> >>>> Another issue I saw with flink cli... >>>> >>>> org.apache.flink.client.program.ProgramInvocationException: The >>>> program execution failed: JobManager did not respond within 120000 ms >>>> at org.apache.flink.client.program.ClusterClient.runDetached(Cl >>>> usterClient.java:524) >>>> at org.apache.flink.client.program.StandaloneClusterClient.subm >>>> itJob(StandaloneClusterClient.java:103) >>>> at org.apache.flink.client.program.ClusterClient.run(ClusterCli >>>> ent.java:456) >>>> at org.apache.flink.client.program.DetachedEnvironment.finalize >>>> Execute(DetachedEnvironment.java:77) >>>> at org.apach >>>> >>>> This was a simple submission and it does succeed through the UI. >>>> >>>> Has there been a regression on CLI... I could not find any >>>> documentation around it. >>>> >>>> I have a HA JM setup. >>>> >>>> >>>> >>>> >>>> On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[email protected]> >>>> wrote: >>>> >>>>> The watermark issue is know and will be fixed in 1.5.1 >>>>> >>>>> >>>>> On 25.06.2018 15:03, Vishal Santoshi wrote: >>>>> >>>>> Thank you.... >>>>> >>>>> One addition >>>>> >>>>> I do not see WM info on the UI ( Attached ) >>>>> >>>>> Is this a know issue. The same pipe on our production has the WM ( In >>>>> fact never had an issue with Watermarks not appearing ) . Am I missing >>>>> something ? >>>>> >>>>> On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> 1. I don't think a rolling update is possible. Flink 1.5.0 changed >>>>>> the process orchestration and how they communicate. IMO, the way to go is >>>>>> to start a Flink 1.5.0 cluster, take a savepoint on the running job, >>>>>> start >>>>>> from the savepoint on the new cluster and shut the old job down. >>>>>> 2. Savepoints should be compatible. >>>>>> 3. You can keep the slot configuration as before. >>>>>> 4. As I said before, mixing 1.5 and 1.4 processes does not work (or >>>>>> at least, it was not considered a design goal and nobody paid attention >>>>>> that it is possible). >>>>>> >>>>>> Best, Fabian >>>>>> >>>>>> >>>>>> 2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[email protected] >>>>>> >: >>>>>> >>>>>>> >>>>>>> 1. >>>>>>> Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am >>>>>>> not sure we can. It seems that JM cannot recover jobs with this >>>>>>> exception >>>>>>> >>>>>>> Caused by: java.io.InvalidClassException: >>>>>>> org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; >>>>>>> local class incompatible: stream classdesc serialVersionUID = >>>>>>> -647384516034982626, local class serialVersionUID = 2 >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2. >>>>>>> Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> 3. >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ >>>>>>> release-notes/flink-1.5.html#update-configuration-for-rework >>>>>>> ed-job-deployment The taskmanager.numberOfTaskSlots: What would be >>>>>>> the desired setting in a stand alone ( non mesos/yarn ) cluster ? >>>>>>> >>>>>>> >>>>>>> 4. I suspend all jobs and establish 1.5 on the JM ( the TMs are >>>>>>> still running with 1.4 ) . JM refuse to start with >>>>>>> >>>>>>> Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: >>>>>>> 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job >>>>>>> 454cd84a519f3b50e88bcb378d8a1330. >>>>>>> Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: >>>>>>> java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey >>>>>>> Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at >>>>>>> sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown >>>>>>> Source) >>>>>>> Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at >>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:423) >>>>>>> Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at >>>>>>> java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079) >>>>>>> Jun >>>>>>> ..... >>>>>>> >>>>>>> >>>>>>> >>>>>>> Any feedback would be highly appreciated... >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >> >
