[ https://issues.apache.org/jira/browse/FLINK-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951126#comment-14951126 ]
ASF GitHub Bot commented on FLINK-2804: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/1249 [FLINK-2804] [runtime] Add blocking job submission support for HA The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This PR is based on extended versions of PR #1153 and #1227. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink client-recovery Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1249.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1249 ---- commit 958090b5dfb2b9934605e8f2810e521ad6b783e4 Author: Ufuk Celebi <u...@apache.org> Date: 2015-09-03T13:13:28Z [runtime] Add type parameter to ByteStreamStateHandle commit dc9daef275a75b4418502c760a943d298171b583 Author: Ufuk Celebi <u...@apache.org> Date: 2015-09-19T17:53:18Z [clients, temporary] Submit job in detached mode if recovery enabled commit beb3d7cd65b7f86ef05f4ce08e771dea24903d0b Author: Ufuk Celebi <u...@apache.org> Date: 2015-09-20T11:08:24Z [FLINK-2652] [tests] Temporary ignore flakey PartitionRequestClientFactoryTest commit d26d133728f18052ee1fcca3c6a9486ba00c02a4 Author: Ufuk Celebi <u...@apache.org> Date: 2015-09-30T14:38:37Z [FLINK-2792] [jobmanager, logging] Set actor message log level to TRACE commit 8017dbdb29b1e79fcb90aad079726e75dac9fe4c Author: Ufuk Celebi <u...@apache.org> Date: 2015-09-01T15:25:46Z [FLINK-2354] [runtime] Add job graph and checkpoint recovery commit f376385d9ea6a55b60db288299f4dd15f58e2f3e Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-10-08T22:50:07Z [FLINK-2354] [runtime] Remove state changing futures in JobManager Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase Add removeJob which maintains the job in the SubmittedJobGraphStore Make revokeLeadership not remove the jobs from the state backend Fix shading problem with curator by hiding CuratorFramework in ChaosMonkeyITCase commit bd4f4d7ef9e74c205e3a3f9a595581e2365584ae Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-10-09T19:48:31Z Fix YARNHighAvailabilityTest by setting the correct state backend commit 20c089c201e1468b5f1f315c37551e1e32ec17f8 Author: Ufuk Celebi <u...@apache.org> Date: 2015-10-05T12:30:46Z [FLINK-2805] [blobmanager] Write JARs to file state backend for recovery Move StateBackend enum to top level and org.apache.flink.runtime.state Abstract blob store in blob server for recovery commit 6d81141d3ff585867c25e73961e7da061007affe Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-10-07T23:52:07Z [FLINK-2804] [runtime] Add blocking job submission support for HA The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions ---- > Support blocking job submission with Job Manager recovery > --------------------------------------------------------- > > Key: FLINK-2804 > URL: https://issues.apache.org/jira/browse/FLINK-2804 > Project: Flink > Issue Type: Improvement > Affects Versions: 0.10 > Reporter: Ufuk Celebi > Assignee: Till Rohrmann > Priority: Minor > > Submitting a job in a blocking fashion with JobManager recovery and a failing > JobManager fails on the client side (the one submitting the job). The job > still continues to be recovered. > I propose to add simple support to re-retrieve the leading job manager and > update the client actor with it and then wait for the result as before. > As of the current standing in PR #1153 > (https://github.com/apache/flink/pull/1153) the job manager assumes that the > same actor is running and just keeps on sending execution state updates etc. > (if the listening behaviour is not detached). -- This message was sent by Atlassian JIRA (v6.3.4#6332)