I hava already tested it.

[root@node ~]#ll
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
total 32
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
blobStore-273cf1a6-0f98-4c86-801e-5d76fef66a58
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
blobStore-992562a5-f42f-43f7-90de-a415b4dcd398
drwx--x---  4 yarn hadoop 4096 Dec  8 02:29
container_e73_1544101169829_0038_01_000059
drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
flink-dist-cache-6d8dab0c-4034-4bbe-a9b9-b524cf6856e3
drwxr-xr-x  8 yarn hadoop 4096 Dec  8 02:29
flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0
drwxr-xr-x  4 yarn hadoop 4096 Dec  8 02:29 localState
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36

And the derectory "flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0" does not
exist.

[root@node ~]#ll
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
total 12
drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53 localState
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53
rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36

Ben Yan <yan.xiao.bin.m...@gmail.com> 于2018年12月8日周六 上午12:23写道:

> Thank you for your advice! I will check this out next, and I will sync the
> information at any time with new progress.
>
> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月8日周六 上午12:05写道:
>
>> I think then you need to investigate what goes wrong
>> in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If
>> you look at the code it lists the files in a directory and tries to hard
>> link them into another directory, and I would only expect to see the
>> mentioned exception if the original file that we try to link does not
>> exist. However, imo it must exist because we list it in the directory right
>> before the link attempt and Flink is not delete anything in the meantime.
>> So the question is, why can a file that was listed before just suddenly
>> disappear when it is hard linked? The only potential problem could be in
>> the path transformations and concatenations, but they look good to me and
>> also pass all tests, including end-to-end tests that do exactly such a
>> restore. I suggest to either observe the created files and what happens
>> with the one that is mentioned in the exception or introduce debug logging
>> in the code, in particular a check if the listed file (the link target)
>> does exist before linking, which it should in my opinion because it is
>> listed in the directory.
>>
>> On 7. Dec 2018, at 16:33, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
>>
>> The version of the recovered checkpoint is also 1.7.0 .
>>
>> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午11:06写道:
>>
>>> Just to clarify, the checkpoint from which you want to resume in 1.7,
>>> was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
>>> says FileNotFound, but the whole iteration is driven by listing the
>>> existing files. Can you somehow monitor which files and directories are
>>> created during the restore attempt?
>>>
>>> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
>>>
>>> hi ,Stefan
>>>
>>> Thank you for your explanation. I used flink1.6.2, which is without any
>>> problems. I have tested it a few times with version 1.7.0, but every time I
>>> resume from the checkpoint, the job will show the exception I showed
>>> earlier, which will make the job unrecoverable.And I checked all the logs,
>>> except for this exception, there are no other exceptions.
>>>
>>> The following is all the logs when an exception occurs:
>>>
>>> 2018-12-06 22:53:41,282 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
>>> KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from 
>>> DEPLOYING to RUNNING.
>>>
>>> 2018-12-06 22:53:41,285 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
>>> KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from 
>>> RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>     at 
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>>> state backend for 
>>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of 
>>> the 1 provided restore options.
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>     ... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>>>  -> 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>>     at 
>>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>     at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>     at 
>>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>     at java.nio.file.Files.createLink(Files.java:1086)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>     ... 7 more
>>> 2018-12-06 22:53:41,286 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job 
>>> Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state 
>>> RUNNING to FAILING.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>     at 
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>>> state backend for 
>>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of 
>>> the 1 provided restore options.
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>     at 
>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>     ... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
>>>  -> 
>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>>     at 
>>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>     at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>     at 
>>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>     at java.nio.file.Files.createLink(Files.java:1086)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>     at 
>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>     at 
>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>     ... 7 more
>>> 2018-12-06 22:53:41,287 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
>>> topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING 
>>> to CANCELING.
>>>
>>>
>>>
>>> Best,
>>> Ben
>>>
>>> Stefan Richter <s.rich...@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>>>
>>>> Hi,
>>>>
>>>> From what I can see in the log here, it looks like your RocksDB is not
>>>> recovering from local but from a remote filesystem. This recovery basically
>>>> has steps:
>>>>
>>>> 1: Create a temporary directory (in your example, this is the dir that
>>>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>>>> mainly sst files from remote fs to the temporary directory in local fs.
>>>>
>>>> 2: List all the downloaded files in the temporary directory and either
>>>> hardlink (for sst files) or copy (for all other files) the listed files
>>>> into the new RocksDb instance path (the path that ends with …/db)
>>>>
>>>> 3: Open the new db from the instance path, delete the temporary
>>>> directory.
>>>>
>>>> Now what is very surprising here is that it claims some file was not
>>>> found (not clear which one, but I assume the downloaded file). However, how
>>>> the file can be lost between downloading/listing and the attempt to
>>>> hardlink it is very mysterious. Can you check the logs for any other
>>>> exceptions and can you check what files exist in the recovery (e.g. what is
>>>> downloaded, if the instance path is there, …). For now, I cannot see how a
>>>> listed file could suddenly disappear, Flink will only delete the temporary
>>>> directory if recovery is completed or failed.
>>>>
>>>> Also: is this problem deterministic or was this a singularity? Did you
>>>> use a different Flink version before (which worked)?
>>>>
>>>> Best,
>>>> Stefan
>>>>
>>>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.m...@gmail.com> wrote:
>>>>
>>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as
>>>> statebackend, but recently I found the following exception when the job
>>>> resumed from the checkpoint. Task-local state is always considered a
>>>> secondary copy, the ground truth of the checkpoint state is the primary
>>>> copy in the distributed store. But it seems that the job did not
>>>> recover from hdfs, and it failed directly.Hope someone can give me
>>>> advices or hints about the problem that I encountered.
>>>>
>>>>
>>>> 2018-12-06 22:54:04,171 INFO  
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - 
>>>> KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from 
>>>> RUNNING to FAILED.
>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>>    at 
>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>>    at 
>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>>    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>>    at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
>>>> state backend for 
>>>> KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of 
>>>> the 1 provided restore options.
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>>    ... 5 more
>>>> Caused by: java.nio.file.NoSuchFileException: 
>>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst
>>>>  -> 
>>>> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>>>    at 
>>>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>>    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>>    at 
>>>> sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>>    at java.nio.file.Files.createLink(Files.java:1086)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>>    at 
>>>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>>    at 
>>>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>>    ... 7 more
>>>>
>>>>
>>>> Best
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>
>>

Reply via email to