Thank you for your advice! I will check this out next, and I will sync the information at any time with new progress.
Stefan Richter <s.rich...@data-artisans.com> 于2018年12月8日周六 上午12:05写道: > I think then you need to investigate what goes wrong > in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If > you look at the code it lists the files in a directory and tries to hard > link them into another directory, and I would only expect to see the > mentioned exception if the original file that we try to link does not > exist. However, imo it must exist because we list it in the directory right > before the link attempt and Flink is not delete anything in the meantime. > So the question is, why can a file that was listed before just suddenly > disappear when it is hard linked? The only potential problem could be in > the path transformations and concatenations, but they look good to me and > also pass all tests, including end-to-end tests that do exactly such a > restore. I suggest to either observe the created files and what happens > with the one that is mentioned in the exception or introduce debug logging > in the code, in particular a check if the listed file (the link target) > does exist before linking, which it should in my opinion because it is > listed in the directory. > > On 7. Dec 2018, at 16:33, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: > > The version of the recovered checkpoint is also 1.7.0 . > > Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午11:06写道: > >> Just to clarify, the checkpoint from which you want to resume in 1.7, was >> that taken by 1.6 or by 1.7? So far this is a bit mysterious because it >> says FileNotFound, but the whole iteration is driven by listing the >> existing files. Can you somehow monitor which files and directories are >> created during the restore attempt? >> >> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: >> >> hi ,Stefan >> >> Thank you for your explanation. I used flink1.6.2, which is without any >> problems. I have tested it a few times with version 1.7.0, but every time I >> resume from the checkpoint, the job will show the exception I showed >> earlier, which will make the job unrecoverable.And I checked all the logs, >> except for this exception, there are no other exceptions. >> >> The following is all the logs when an exception occurs: >> >> 2018-12-06 22:53:41,282 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - KeyedProcess >> (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to >> RUNNING. >> >> 2018-12-06 22:53:41,285 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - KeyedProcess >> (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED. >> java.lang.Exception: Exception while creating StreamOperatorStateContext. >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >> at >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >> state backend for >> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of >> the 1 provided restore options. >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >> ... 5 more >> Caused by: java.nio.file.NoSuchFileException: >> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst >> -> >> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst >> at >> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >> at >> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >> at java.nio.file.Files.createLink(Files.java:1086) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >> ... 7 more >> 2018-12-06 22:53:41,286 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >> Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state >> RUNNING to FAILING. >> java.lang.Exception: Exception while creating StreamOperatorStateContext. >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >> at >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >> state backend for >> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of >> the 1 provided restore options. >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >> at >> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >> ... 5 more >> Caused by: java.nio.file.NoSuchFileException: >> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst >> -> >> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst >> at >> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >> at >> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >> at java.nio.file.Files.createLink(Files.java:1086) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >> at >> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >> at >> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >> ... 7 more >> 2018-12-06 22:53:41,287 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: >> topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING >> to CANCELING. >> >> >> >> Best, >> Ben >> >> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午10:00写道: >> >>> Hi, >>> >>> From what I can see in the log here, it looks like your RocksDB is not >>> recovering from local but from a remote filesystem. This recovery basically >>> has steps: >>> >>> 1: Create a temporary directory (in your example, this is the dir that >>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, >>> mainly sst files from remote fs to the temporary directory in local fs. >>> >>> 2: List all the downloaded files in the temporary directory and either >>> hardlink (for sst files) or copy (for all other files) the listed files >>> into the new RocksDb instance path (the path that ends with …/db) >>> >>> 3: Open the new db from the instance path, delete the temporary >>> directory. >>> >>> Now what is very surprising here is that it claims some file was not >>> found (not clear which one, but I assume the downloaded file). However, how >>> the file can be lost between downloading/listing and the attempt to >>> hardlink it is very mysterious. Can you check the logs for any other >>> exceptions and can you check what files exist in the recovery (e.g. what is >>> downloaded, if the instance path is there, …). For now, I cannot see how a >>> listed file could suddenly disappear, Flink will only delete the temporary >>> directory if recovery is completed or failed. >>> >>> Also: is this problem deterministic or was this a singularity? Did you >>> use a different Flink version before (which worked)? >>> >>> Best, >>> Stefan >>> >>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote: >>> >>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, >>> but recently I found the following exception when the job resumed from the >>> checkpoint. Task-local state is always considered a secondary copy, the >>> ground truth of the checkpoint state is the primary copy in the distributed >>> store. But it seems that the job did not recover from hdfs, and it >>> failed directly.Hope someone can give me advices or hints about the >>> problem that I encountered. >>> >>> >>> 2018-12-06 22:54:04,171 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>> KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from >>> RUNNING to FAILED. >>> java.lang.Exception: Exception while creating StreamOperatorStateContext. >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) >>> at >>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) >>> at >>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) >>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed >>> state backend for >>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of >>> the 1 provided restore options. >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284) >>> at >>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) >>> ... 5 more >>> Caused by: java.nio.file.NoSuchFileException: >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst >>> -> >>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst >>> at >>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) >>> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) >>> at >>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) >>> at java.nio.file.Files.createLink(Files.java:1086) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525) >>> at >>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151) >>> at >>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) >>> ... 7 more >>> >>> >>> Best >>> >>> Ben >>> >>> >>> >> >