[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084901#comment-17084901 ]
Stephan Ewen edited comment on FLINK-16770 at 4/16/20, 2:20 PM: ---------------------------------------------------------------- Thank you all for the great discussion and analysis. I would like to add a few points and suggestions, from the way I understand the problem: h2. There are are two main issues: *(1) Missing ownership in the multi-threaded system. Meaning: Who owns the "Pending Checkpoint during Finalization"?* - It is owned by the CheckpointCoordinator (who aborts it when shutting down) - It is also owned by the I/O Thread or the Completed Checkpoint Store who writes it to ZooKeeper (or similar system). *(2) No Shared Ground Truth between the Checkpoint Coordinator and the JobMaster* - When a checkpoint is finalized, that decision is not consistently visible to the JM. - The JM only sees the result once it is in ZK, which is an asynchronous operation - That causes the final issue described here: possibility that the JM starts from an earlier checkpoint, if a restart happens while the async writing to ZK still happens. - NOTE: It is fine to ignore a checkpoint that was completed, if we did not send "notification complete" and we are sure it will always be ignored. That would be as if the checkpoint never completed. - NOTE: It is not fine to ignore it and start from an earlier checkpoint if it will get committed later. That is the bug to prevent. h2. Two steps to a cleaner solution *(1) When the checkpoint is ready (all tasks acked, metadata written out), Checkpoint Coordinator transfers ownership to the CompletedCheckpointStore.* - That means the Checkpoint is removed from the "Pending Checkpoints" map and added to the CompletedCheckpointStore in one call in the main thread. If this is in one call, it is atomic against other modifications (cancellation, disposing checkpoints). Because the checkpoint is removed from the "Pending Checkpoints" map (not owned by the coordinator any more) it will not get cancelled during shutdown of the coordinator. ==> This is a very simple change *(2) The addition to the CompletedCheckpointStore must be constant time and executed in the main thread* - That means that the CompletedCheckpointStore would put the Completed Checkpoint into a local list and then kick off the asynchronous request to add it to ZK. - If the JM looks up the latest checkpoint, it refers to that local list. That way all local components refer to the same status and do not exchage status asynchronously via an external system (ZK). ==> The change is that the CompletedCheckpointStore would not always repopulate itself from ZK upon "restore checkpoint", but keep the local state and only repopulate itself when the master gains leader status (and clears itself when leader status is lost). ==> This is a slightly more complex change, but not too big. h2. Distributed Races and Corner Cases I think this is an existing corner case issue, not related to this bug, but I list it here, for consistency. It exists, because JM failover can happen concurrently with ZK updates. * Once the call to add the checkpoint to ZK is sent off, the checkpoint might or might not get added to ZK (which is the distributed ground truth). * During that time, we cannot restore at all. ** If the JM already restored form the checkpoint, it sends "restore state" to the tasks, which is equivalent to "notify checkpoint complete" and materializes external side effects. If the addition to ZK then fails and the JM fails and another JM becomes leader, it will restore from an earlier checkpoint ** If the JM restores from an earlier checkpoint during that time, and then the ZK call completes, we have duplicate side effects. * In both cases we get fractured consistency or duplicate side effects I see three possible solutions, which are not easy or not great *(a) We cannot restore during the period where the checkpoint is in "uncertain if committed" state* * The CompletedCheckpointStore would need to keep the Checkpoint in a "uncertain" list initially, until the I/O executor call returns from adding the Checkpoint to ZK. * When asking the CompletedCheckpointStore for the latest checkpoint, it returns a CompletableFuture. * While the latest checkpoint is in that list, the future cannot be completed. It completes when the ZK command completes (usually few 100ms). Restore operations would need to wait during that time. * There is a separate issue FLINK-16931 where "loading metadata" for the latest completed checkpoint can take long (seconds), because it is an I/O operations. This sounds like a similar issue, but I fear that the solution is more complex that anticipated in that issue. *(b) Change the contracts with operators that side-effects are never committed during restore.* * Then it is safe to restore already from the operator that is not yet in ZK, because the restore never creates side effects. * The side effects would only be committed after the ZK write is done (notify checkpoint complete) * During failover, it means that side effects get committed later, because they are not committed during "restore" but only as part of the next completed checkpoint. *(c) We actually keep the "add to ZK" command blocking/synchronous for now* * This blocks the RPC thread, which is really bad * It can in the worst case lead to system crashes, if it blocks for too long. * We can mitigate this a bit, by running the actual I/O in a separate thread and aborting it with a timeout. Then, double checking ZK whether the update did go through or not. was (Author: stephanewen): Thank you all for the great discussion and analysis. I would like to add a few points and suggestions, from the way I understand the problem: h2. There are are two main issues: *(1) Missing ownership in the multi-threaded system. Meaning: Who owns the "Pending Checkpoint during Finalization"?* - It is owned by the CheckpointCoordinator (who aborts it when shutting down) - It is also owned by the I/O Thread or the Completed Checkpoint Store who writes it to ZooKeeper (or similar system). *(2) No Shared Ground Truth between the Checkpoint Coordinator and the JobMaster* - When a checkpoint is finalized, that decision is not consistently visible to the JM. - The JM only sees the result once it is in ZK, which is an asynchronous operation - That causes the final issue described here: possibility that the JM starts from an earlier checkpoint, if a restart happens while the async writing to ZK still happens. - NOTE: It is fine to ignore a checkpoint that was completed, if we did not send "notification complete" and we are sure it will always be ignored. That would be as if the checkpoint never completed. - NOTE: It is not fine to ignore it and start from an earlier checkpoint if it will get committed later. That is the bug to prevent. h2. Two steps to a cleaner solution *(1) When the checkpoint is ready (all tasks acked, metadata written out), Checkpoint Coordinator transfers ownership to the CompletedCheckpointStore.* - That means the Checkpoint is removed from the "Pending Checkpoints" map and added to the CompletedCheckpointStore in one call in the main thread. If this is in one call, it is atomic against other modifications (cancellation, disposing checkpoints). Because the checkpoint is removed from the "Pending Checkpoints" map (not owned by the coordinator any more) it will not get cancelled during shutdown of the coordinator. ==> This is a very simple change *(2) The addition to the CompletedCheckpointStore must be constant time and executed in the main thread* - That means that the CompletedCheckpointStore would put the Completed Checkpoint into a local list and then kick off the asynchronous request to add it to ZK. - If the JM looks up the latest checkpoint, it refers to that local list. That way all local components refer to the same status and do not exchage status asynchronously via an external system (ZK). ==> The change is that the CompletedCheckpointStore would not always repopulate itself from ZK upon "restore checkpoint", but keep the local state and only repopulate itself when the master gains leader status (and clears itself when leader status is lost). ==> This is a slightly more complex change, but not too big. h2. Distributed Races and Corner Cases I think this is an existing corner case issue, not related to this bug, but I list it here, for consistency. It exists, because JM failover can happen concurrently with ZK updates. * Once the call to add the checkpoint to ZK is sent off, the checkpoint might or might not get added to ZK (which is the distributed ground truth). * During that time, we cannot restore at all. ** If the JM already restored form the checkpoint, it sends "restore state" to the tasks, which is equivalent to "notify checkpoint complete" and materializes external side effects. If the addition to ZK then fails and the JM fails and another JM becomes leader, it will restore from an earlier checkpoint ** If the JM restores from an earlier checkpoint during that time, and then the ZK call completes, we have duplicate side effects. * In both cases we get fractured consistency or duplicate side effects I see two possible solutions, both not easy *(a) We cannot restore during the period where the checkpoint is in "uncertain if committed" state* * The CompletedCheckpointStore would need to keep the Checkpoint in a "uncertain" list initially, until the I/O executor call returns from adding the Checkpoint to ZK. * When asking the CompletedCheckpointStore for the latest checkpoint, it returns a CompletableFuture. * While the latest checkpoint is in that list, the future cannot be completed. It completes when the ZK command completes (usually few 100ms). Restore operations would need to wait during that time. * There is a separate issue FLINK-16931 where "loading metadata" for the latest completed checkpoint can take long (seconds), because it is an I/O operations. This sounds like a similar issue, but I fear that the solution is more complex that anticipated in that issue. *(b) Change the contracts with operators that side-effects are never committed during restore.* * Then it is safe to restore already from the operator that is not yet in ZK, because the restore never creates side effects. * The side effects would only be committed after the ZK write is done (notify checkpoint complete) * During failover, it means that side effects get committed later, because they are not committed during "restore" but only as part of the next completed checkpoint. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > ------------------------------------------------------------------------------------------------------- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests > Affects Versions: 1.11.0 > Reporter: Zhijiang > Assignee: Yun Tang > Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 40m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > ============================================================================== > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > ------------------------------------------------------------ > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z at > java.security.AccessController.doPrivileged(Native Method) > 2020-03-25T06:50:58.4764809Z at > javax.security.auth.Subject.doAs(Subject.java:422) > 2020-03-25T06:50:58.4765434Z at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > 2020-03-25T06:50:58.4766180Z at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > 2020-03-25T06:50:58.4773549Z at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:963) > 2020-03-25T06:50:58.4774502Z Caused by: java.lang.RuntimeException: > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4775382Z at > org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:276) > 2020-03-25T06:50:58.4776163Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1741) > 2020-03-25T06:50:58.4777706Z at > org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:90) > 2020-03-25T06:50:58.4778334Z at > org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:58) > 2020-03-25T06:50:58.4779007Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620) > 2020-03-25T06:50:58.4779654Z at > org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:215) > 2020-03-25T06:50:58.4780371Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-03-25T06:50:58.4784367Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-03-25T06:50:58.4785063Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-03-25T06:50:58.4785557Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-03-25T06:50:58.4786204Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321) > 2020-03-25T06:50:58.4786547Z ... 11 more > 2020-03-25T06:50:58.4787007Z Caused by: > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4787717Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-03-25T06:50:58.4788203Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > 2020-03-25T06:50:58.4788835Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1736) > 2020-03-25T06:50:58.4789362Z ... 20 more > 2020-03-25T06:50:58.4789720Z Caused by: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4790467Z at > org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:359) > 2020-03-25T06:50:58.4791087Z at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) > 2020-03-25T06:50:58.4791650Z at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) > 2020-03-25T06:50:58.4792560Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-03-25T06:50:58.4793617Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-03-25T06:50:58.4794496Z at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274) > 2020-03-25T06:50:58.4795255Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2020-03-25T06:50:58.4796264Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2020-03-25T06:50:58.4796867Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-03-25T06:50:58.4797439Z at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) > 2020-03-25T06:50:58.4798000Z at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943) > 2020-03-25T06:50:58.4798589Z at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > 2020-03-25T06:50:58.4799162Z at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2020-03-25T06:50:58.4799727Z at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2020-03-25T06:50:58.4800210Z at java.lang.Thread.run(Thread.java:748) > 2020-03-25T06:50:58.4800767Z Caused by: > org.apache.flink.runtime.rest.util.RestClientException: [Internal server > error., <Exception on server side: > 2020-03-25T06:50:58.4801351Z > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job. > 2020-03-25T06:50:58.4801938Z at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$3(Dispatcher.java:336) > 2020-03-25T06:50:58.4803660Z at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > 2020-03-25T06:50:58.4804555Z at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > 2020-03-25T06:50:58.4805235Z at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > 2020-03-25T06:50:58.4805839Z at > akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > 2020-03-25T06:50:58.4806515Z at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > 2020-03-25T06:50:58.4807184Z at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2020-03-25T06:50:58.4807807Z at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2020-03-25T06:50:58.4808417Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-03-25T06:50:58.4809055Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-03-25T06:50:58.4809783Z Caused by: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > 2020-03-25T06:50:58.4810756Z at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) > 2020-03-25T06:50:58.4811444Z at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > 2020-03-25T06:50:58.4811937Z ... 6 more > 2020-03-25T06:50:58.4812414Z Caused by: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > 2020-03-25T06:50:58.4813330Z at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152) > 2020-03-25T06:50:58.4814154Z at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) > 2020-03-25T06:50:58.4814846Z at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379) > 2020-03-25T06:50:58.4815622Z at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > 2020-03-25T06:50:58.4816074Z ... 7 more > 2020-03-25T06:50:58.4816924Z Caused by: java.io.IOException: Cannot access > file system for checkpoint/savepoint path 'file://.'. > 2020-03-25T06:50:58.4817673Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:233) > 2020-03-25T06:50:58.4818450Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110) > 2020-03-25T06:50:58.4819276Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1312) > 2020-03-25T06:50:58.4819943Z at > org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:314) > 2020-03-25T06:50:58.4820633Z at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:247) > 2020-03-25T06:50:58.4821258Z at > org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:223) > 2020-03-25T06:50:58.4821862Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:118) > 2020-03-25T06:50:58.4822505Z at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:103) > 2020-03-25T06:50:58.4823115Z at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:282) > 2020-03-25T06:50:58.4823665Z at > org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:270) > 2020-03-25T06:50:58.4824485Z at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) > 2020-03-25T06:50:58.4825597Z at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) > 2020-03-25T06:50:58.4826400Z at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146) > 2020-03-25T06:50:58.4826919Z ... 10 more > 2020-03-25T06:50:58.4829018Z Caused by: java.io.IOException: Found local file > path with authority '.' in path 'file://.'. Hint: Did you forget a slash? > (correct path would be 'file:///.') > 2020-03-25T06:50:58.4829875Z at > org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:441) > 2020-03-25T06:50:58.4830364Z at > org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389) > 2020-03-25T06:50:58.4830807Z at > org.apache.flink.core.fs.Path.getFileSystem(Path.java:292) > 2020-03-25T06:50:58.4831408Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:230) > 2020-03-25T06:50:58.4832021Z ... 22 more > 2020-03-25T06:50:58.4832151Z > 2020-03-25T06:50:58.4832356Z End of exception on server side>] > 2020-03-25T06:50:58.4832720Z at > org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:390) > 2020-03-25T06:50:58.4833238Z at > org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:374) > 2020-03-25T06:50:58.4833884Z at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966) > 2020-03-25T06:50:58.4834376Z at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) > 2020-03-25T06:50:58.4834724Z ... 4 more > 2020-03-25T06:50:58.5042321Z Resuming from externalized checkpoint job could > not be started. > 2020-03-25T06:50:58.5044210Z [FAIL] Test script contains errors. > 2020-03-25T06:50:58.5052826Z Checking of logs skipped. > 2020-03-25T06:50:58.5053164Z > 2020-03-25T06:50:58.5054116Z [FAIL] 'Resuming Externalized Checkpoint (rocks, > incremental, scale up) end-to-end test' failed after 0 minutes and 27 > seconds! Test exited with exit code 1 > 2020-03-25T06:50:58.5054639Z > 2020-03-25T06:50:58.8067813Z Stopping taskexecutor daemon (pid: 86888) on > host fv-az655. > 2020-03-25T06:50:59.0257270Z Stopping standalonesession daemon (pid: 86603) > on host fv-az655. > 2020-03-25T06:50:59.4920994Z > 2020-03-25T06:50:59.5000014Z ##[error]Bash exited with code '1'. > 2020-03-25T06:50:59.5015374Z ##[section]Finishing: Run e2e tests > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)