Hi, Ning

From the log message you given, the two operate share the same directory, and 
when snapshot, the directory will be deleted first if it 
exists(RocksIncrementalSnapshotStrategy#prepareLocalSnapshotDirectory).

I did not find an issue for this problem, and I don’t thinks this is a problem 
of UUID generation problem, please check the path generation logic in 
LocalRecoveryDirectoryProviderImpl#subtaskSpecificCheckpointDirectory.

I’ve created an issue for this problem.

Best, Congxian
On Apr 23, 2019, 11:19 +0800, Ning Shi <nings...@gmail.com>, wrote:
> We have a Flink job using RocksDB state backend. We found that one of the
> RichMapFunction state was not being saved in checkpoints or savepoints. After
> some digging, it seems that two operators in the same operator chain are
> colliding with each other during checkpoint or savepoint, resulting in one of
> the operator's state to be missing.
>
> I extracted all the checkpoint directory for all operators from the RocksDB 
> LOG
> files for one of the checkpoints. As you can see, the StreamMap operator 
> shared
> the same checkpoint directory with the CoBroadcastWithKeyedOperator. They are 
> in
> the same operator chain.
>
> /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_1/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__2_90__
> /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_53/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__54_90__
> /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_53/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__54_90__
> /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_77/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__78_90__
> /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_84/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__85_90__
> /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_66/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__67_90__
> /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_66/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__67_90__
> /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_15/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__16_90__
> /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_53/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__54_90__
> /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_1/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__2_90__
> /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_1/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__2_90__
> /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_56/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__57_90__
> /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_30/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__31_90__
> /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_46/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__47_90__
> /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_46/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__47_90__
> /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_12/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__13_90__
> /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_46/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__47_90__
> /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_30/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__31_90__
> /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_30/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__31_90__
> /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_79/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__80_90__
> /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_66/chk_21244/rocks_db
>  CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__67_90__
> /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_84/chk_21244/rocks_db
>  CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__85_90__
> /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_84/chk_21244/rocks_db
>  StreamMap_3c5866a6cc097b462de842b2ef91910d__85_90__
> /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_58/chk_21244/rocks_db
>  WindowOperator_bc2936094388a70852534bd6c0fce178__59_90__
>
> After each checkpoint, when I checked the checkpoint directory for the 
> StreamMap
> operator state, the SST files are not there. Restoring a new job from the same
> checkpoint or savepoint also confirmed that the StreamMap states were missing,
> but with no error reported by Flink.
>
> I also used strace to capture file I/O during checkpoints. I could see that 
> the
> StreamMap operator succeeded in creating the checkpoint directory, but
> immediately after that it received a lot of "-1 ENOENT (No such file or
> directory)" errors, possibly because the directory was over-written by the 
> other
> operator.
>
> Is this an known issue? It seems that the UUID generation of chained operators
> are not differentiating the two operators, resulting in data loss.
>
> Thanks,
>
> Ning

Reply via email to