[ 
https://issues.apache.org/jira/browse/FLINK-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997994#comment-15997994
 ] 

ASF GitHub Bot commented on FLINK-6364:
---------------------------------------

Github user StefanRRichter commented on the issue:

    https://github.com/apache/flink/pull/3801
  
    I am sorry, but before merging I noticed that some tests (e.g. 
`RocksDBStateBackendTest.testCancelRunningSnapshot`) fail sporadically (only on 
Travis). I tracked the problem and I think the cause is a lack of eagerly 
closing the streams in `cancel()` to interrupt blocking IO calls.
    
    I suggest the following fix:
    
    `RocksDBIncrementalSnapshotOperation` should have it’s own 
`CloseableRegistry`. This tracks all the open streams inside the checkpointing 
and is registered with the backends registry for as long as the task runs. 
Then, in cancel, as a first step we can close and unregister that inner 
`CloseableRegistry`. This also prevents races that the current stream gets 
closed asynchronously by `cancel()`, which the checkpointing actually already 
opened the next stream (the registry closes and blocks new streams on 
registration once it is closed)


> Implement incremental checkpointing in RocksDBStateBackend
> ----------------------------------------------------------
>
>                 Key: FLINK-6364
>                 URL: https://issues.apache.org/jira/browse/FLINK-6364
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>            Reporter: Xiaogang Shi
>            Assignee: Xiaogang Shi
>
> {{RocksDBStateBackend}} is well suited for incremental checkpointing because 
> RocksDB is base on LSM trees,  which record updates in new sst files and all 
> sst files are immutable. By only materializing those new sst files, we can 
> significantly improve the performance of checkpointing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to