Re: Missing state in RocksDB checkpoints

2019-04-24 Thread Ning Shi
Till, Thank you for escalating this to blocker. I agree that data loss is always a serious issue. For reference, the workaround is to unchain the stateful operators. To make the new job be able to recover from previous checkpoint, we also had to change the UID of the operator that was missing

Re: Missing state in RocksDB checkpoints

2019-04-24 Thread Till Rohrmann
Thanks for reporting this issue Ning. I think this is actually a blocker for the next release and should be fixed right away. For future reference here is the issue [1]. I've also pulled in Stefan who knows these components very well. [1] https://issues.apache.org/jira/browse/FLINK-12296 Cheers,

Re: Missing state in RocksDB checkpoints

2019-04-23 Thread Ning Shi
On Tue, 23 Apr 2019 10:53:52 -0400, Congxian Qiu wrote: > Sorry for the misleading, in the previous email, I just want to say the > problem is not caused by the UUID generation, it is caused by the different > operators share the same directory(because currentlyFlink uses JobVertx as > the direc

Re: Missing state in RocksDB checkpoints

2019-04-23 Thread Congxian Qiu
Hi Ning, Sorry for the misleading, in the previous email, I just want to say the problem is not caused by the UUID generation, it is caused by the different operators share the same directory(because currentlyFlink uses JobVertx as the directory) Best, Congxian On Apr 23, 2019, 19:41 +0800, Ning

Re: Missing state in RocksDB checkpoints

2019-04-23 Thread Ning Shi
Congxian, We just did a test. Separating the two stateful operators from chaining seems to have worked around the problem. The states for both of them are successfully saved in the checkpoint. Ning On Tue, Apr 23, 2019 at 7:41 AM Ning Shi wrote: > > Congxian, > > Thank you for creating the tick

Re: Missing state in RocksDB checkpoints

2019-04-23 Thread Ning Shi
Congxian, Thank you for creating the ticket and providing the relevant code. I’m curious why you don’t think the directory collision is not a problem. What we observe is that one of the operator states are not included in the checkpoint and data is lost on restore. That’s a pretty serious probl

Re: Missing state in RocksDB checkpoints

2019-04-22 Thread Congxian Qiu
Hi, Ning From the log message you given, the two operate share the same directory, and when snapshot, the directory will be deleted first if it exists(RocksIncrementalSnapshotStrategy#prepareLocalSnapshotDirectory). I did not find an issue for this problem, and I don’t thinks this is a problem

Missing state in RocksDB checkpoints

2019-04-22 Thread Ning Shi
We have a Flink job using RocksDB state backend. We found that one of the RichMapFunction state was not being saved in checkpoints or savepoints. After some digging, it seems that two operators in the same operator chain are colliding with each other during checkpoint or savepoint, resulting in one