[ https://issues.apache.org/jira/browse/FLINK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Anderson resolved FLINK-11997. ------------------------------------ Resolution: Duplicate > ConcurrentModificationException: ZooKeeper unexpectedly modified > ---------------------------------------------------------------- > > Key: FLINK-11997 > URL: https://issues.apache.org/jira/browse/FLINK-11997 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.8.0 > Environment: Flink 1.8.0-rc4, running in a k8s job cluster with > checkpointing and savepointing in minio. Zookeeper enabled, also saving to > minio. > jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 > jobmanager.heap.size: 1024m > taskmanager.heap.size: 1024m > taskmanager.numberOfTaskSlots: 4 > parallelism.default: 4 > high-availability: zookeeper > high-availability.jobmanager.port: 6123 > high-availability.storageDir: s3://highavailability/storage > high-availability.zookeeper.quorum: zoo1:2181 > state.backend: filesystem > state.checkpoints.dir: s3://state/checkpoints > state.savepoints.dir: s3://state/savepoints > rest.port: 8081 > zookeeper.sasl.disable: true > s3.access-key: minio > s3.secret-key: minio123 > s3.path-style-access: true > s3.endpoint: http://minio-service:9000 > > Reporter: David Anderson > Priority: Major > Attachments: FAILURE > > > Trying to rescale a job running in a k8s job cluster via > {{flink modify 00000000000000000000000000000000 -p 2 -m localhost:30081}} > Rescaling works fine if HA is off. Taking a savepoint and restarting from one > also works fine, even with HA turned on. But rescaling by modifying the job > with HA on always fails as shown below: > Caused by: org.apache.flink.util.FlinkException: Failed to rescale the job > 00000000000000000000000000000000. > ... 21 more > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could > not restore from temporary rescaling savepoint. This might indicate that the > savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. > Deleting this savepoint as a precaution. > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$rescaleOperators$4(JobMaster.java:470) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > ... 18 more > Caused by: > org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could > not restore from temporary rescaling savepoint. This might indicate that the > savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. > Deleting this savepoint as a precaution. > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1433) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.ConcurrentModificationException: ZooKeeper unexpectedly > modified > at > org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:159) > at > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.addCheckpoint(ZooKeeperCompletedCheckpointStore.java:216) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1106) > at > org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1251) > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1413) > ... 10 more > Caused by: > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$NodeExistsException: > KeeperErrorCode = NodeExists > at > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:119) > at > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006) > at > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125) > at > org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) > at > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122) > at > org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:153) > ... 14 more -- This message was sent by Atlassian JIRA (v7.6.3#76005)