I hava already tested it. [root@node ~]#ll /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/ total 32 drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:29 blobStore-273cf1a6-0f98-4c86-801e-5d76fef66a58 drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:29 blobStore-992562a5-f42f-43f7-90de-a415b4dcd398 drwx--x--- 4 yarn hadoop 4096 Dec 8 02:29 container_e73_1544101169829_0038_01_000059 drwx--x--- 13 yarn hadoop 4096 Dec 8 02:29 filecache drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:29 flink-dist-cache-6d8dab0c-4034-4bbe-a9b9-b524cf6856e3 drwxr-xr-x 8 yarn hadoop 4096 Dec 8 02:29 flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0 drwxr-xr-x 4 yarn hadoop 4096 Dec 8 02:29 localState drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:29 rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36
And the derectory "flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0" does not exist. [root@node ~]#ll /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/ total 12 drwx--x--- 13 yarn hadoop 4096 Dec 8 02:29 filecache drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:53 localState drwxr-xr-x 2 yarn hadoop 4096 Dec 8 02:53 rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36 Ben Yan <yan.xiao.bin.m...@gmail.com> 于2018年12月8日周六 上午12:23写道: > Thank you for your advice! I will check this out next, and I will sync the > information at any time with new progress. > > Stefan Richter <s.rich...@data-artisans.com> 于2018年12月8日周六 上午12:05写道: > >> I think then you need to investigate what goes wrong >> in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If >> you look at the code it lists the files in a directory and tries to hard >> link them into another directory, and I would only expect to see the >> mentioned exception if the original file that we try to link does not >> exist. However, imo it must exist because we list it in the directory right >> before the link attempt and Flink is not delete anything in the meantime. >> So the question is, why can a file that was listed before just suddenly >> disappear when it is hard linked? The only potential problem could be in >> the path transformations and concatenations, but they look good to me and >> also pass all tests, including end-to-end tests that do exactly such a >> restore. I suggest to either observe the created files and what happens >> with the one that is mentioned in the exception or introduce debug logging >> in the code, in particular a check if the listed file (the link target) >> does exist before linking, which it should in my opinion because it is >> listed in the directory. >> >> On 7. Dec 2018, at 16:33, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: >> >> The version of the recovered checkpoint is also 1.7.0 . >> >> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午11:06写道: >> >>> Just to clarify, the checkpoint from which you want to resume in 1.7, >>> was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it >>> says FileNotFound, but the whole iteration is driven by listing the >>> existing files. Can you somehow monitor which files and directories are >>> created during the restore attempt? >>> >>> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: >>> >>> hi ,Stefan >>> >>> Thank you for your explanation. I used flink1.6.2, which is without any >>> problems. I have tested it a few times with version 1.7.0, but every time I >>> resume from the checkpoint, the job will show the exception I showed >>> earlier, which will make the job unrecoverable.And I checked all the logs, >>> except for this exception, there are no other exceptions. >>> >>> The following is all the logs when an exception occurs: >>> >>> 2018-12-06 22:53:41,282 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from >>> DEPLOYING to RUNNING. >>> >>> 2018-12-06 22:53:41,285 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from >>> RUNNING to FAILED. >>> java.lang.Exception: Exception while creating StreamOperatorStateContext. >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >>> at >>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >>> state backend for >>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of >>> the 1 provided restore options. >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >>> ... 5 more >>> Caused by: java.nio.file.NoSuchFileException: >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst >>> -> >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst >>> at >>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >>> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >>> at >>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >>> at java.nio.file.Files.createLink(Files.java:1086) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >>> ... 7 more >>> 2018-12-06 22:53:41,286 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>> Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state >>> RUNNING to FAILING. >>> java.lang.Exception: Exception while creating StreamOperatorStateContext. >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >>> at >>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >>> state backend for >>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of >>> the 1 provided restore options. >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >>> ... 5 more >>> Caused by: java.nio.file.NoSuchFileException: >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst >>> -> >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst >>> at >>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >>> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >>> at >>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >>> at java.nio.file.Files.createLink(Files.java:1086) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >>> ... 7 more >>> 2018-12-06 22:53:41,287 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: >>> topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING >>> to CANCELING. >>> >>> >>> >>> Best, >>> Ben >>> >>> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午10:00写道: >>> >>>> Hi, >>>> >>>> From what I can see in the log here, it looks like your RocksDB is not >>>> recovering from local but from a remote filesystem. This recovery basically >>>> has steps: >>>> >>>> 1: Create a temporary directory (in your example, this is the dir that >>>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, >>>> mainly sst files from remote fs to the temporary directory in local fs. >>>> >>>> 2: List all the downloaded files in the temporary directory and either >>>> hardlink (for sst files) or copy (for all other files) the listed files >>>> into the new RocksDb instance path (the path that ends with …/db) >>>> >>>> 3: Open the new db from the instance path, delete the temporary >>>> directory. >>>> >>>> Now what is very surprising here is that it claims some file was not >>>> found (not clear which one, but I assume the downloaded file). However, how >>>> the file can be lost between downloading/listing and the attempt to >>>> hardlink it is very mysterious. Can you check the logs for any other >>>> exceptions and can you check what files exist in the recovery (e.g. what is >>>> downloaded, if the instance path is there, …). For now, I cannot see how a >>>> listed file could suddenly disappear, Flink will only delete the temporary >>>> directory if recovery is completed or failed. >>>> >>>> Also: is this problem deterministic or was this a singularity? Did you >>>> use a different Flink version before (which worked)? >>>> >>>> Best, >>>> Stefan >>>> >>>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: >>>> >>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as >>>> statebackend, but recently I found the following exception when the job >>>> resumed from the checkpoint. Task-local state is always considered a >>>> secondary copy, the ground truth of the checkpoint state is the primary >>>> copy in the distributed store. But it seems that the job did not >>>> recover from hdfs, and it failed directly.Hope someone can give me >>>> advices or hints about the problem that I encountered. >>>> >>>> >>>> 2018-12-06 22:54:04,171 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from >>>> RUNNING to FAILED. >>>> java.lang.Exception: Exception while creating StreamOperatorStateContext. >>>> at >>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >>>> at >>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >>>> state backend for >>>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of >>>> the 1 provided restore options. >>>> at >>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >>>> at >>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >>>> at >>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >>>> ... 5 more >>>> Caused by: java.nio.file.NoSuchFileException: >>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst >>>> -> >>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst >>>> at >>>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >>>> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >>>> at >>>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >>>> at java.nio.file.Files.createLink(Files.java:1086) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >>>> at >>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >>>> at >>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >>>> at >>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >>>> ... 7 more >>>> >>>> >>>> Best >>>> >>>> Ben >>>> >>>> >>>> >>> >>