Till,
Thank you for escalating this to blocker. I agree that data loss is always a
serious issue.
For reference, the workaround is to unchain the stateful operators. To make the
new job be able to recover from previous checkpoint, we also had to change the
UID of the operator that was missing
Thanks for reporting this issue Ning. I think this is actually a blocker
for the next release and should be fixed right away. For future reference
here is the issue [1].
I've also pulled in Stefan who knows these components very well.
[1] https://issues.apache.org/jira/browse/FLINK-12296
Cheers,
On Tue, 23 Apr 2019 10:53:52 -0400,
Congxian Qiu wrote:
> Sorry for the misleading, in the previous email, I just want to say the
> problem is not caused by the UUID generation, it is caused by the different
> operators share the same directory(because currentlyFlink uses JobVertx as
> the direc
Hi Ning,
Sorry for the misleading, in the previous email, I just want to say the problem
is not caused by the UUID generation, it is caused by the different operators
share the same directory(because currentlyFlink uses JobVertx as the directory)
Best, Congxian
On Apr 23, 2019, 19:41 +0800, Ning
Congxian,
We just did a test. Separating the two stateful operators from
chaining seems to have worked around the problem. The states for both
of them are successfully saved in the checkpoint.
Ning
On Tue, Apr 23, 2019 at 7:41 AM Ning Shi wrote:
>
> Congxian,
>
> Thank you for creating the tick
Congxian,
Thank you for creating the ticket and providing the relevant code. I’m curious
why you don’t think the directory collision is not a problem. What we observe
is that one of the operator states are not included in the checkpoint and data
is lost on restore. That’s a pretty serious probl
Hi, Ning
From the log message you given, the two operate share the same directory, and
when snapshot, the directory will be deleted first if it
exists(RocksIncrementalSnapshotStrategy#prepareLocalSnapshotDirectory).
I did not find an issue for this problem, and I don’t thinks this is a problem
We have a Flink job using RocksDB state backend. We found that one of the
RichMapFunction state was not being saved in checkpoints or savepoints. After
some digging, it seems that two operators in the same operator chain are
colliding with each other during checkpoint or savepoint, resulting in one