We have a Flink job using RocksDB state backend. We found that one of the RichMapFunction state was not being saved in checkpoints or savepoints. After some digging, it seems that two operators in the same operator chain are colliding with each other during checkpoint or savepoint, resulting in one of the operator's state to be missing.
I extracted all the checkpoint directory for all operators from the RocksDB LOG files for one of the checkpoints. As you can see, the StreamMap operator shared the same checkpoint directory with the CoBroadcastWithKeyedOperator. They are in the same operator chain. /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_1/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__2_90__ /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_53/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__54_90__ /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_53/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__54_90__ /var/flink/data/localState/aid_AllocationID{37a99d74a8e452ff06257c61ab13a3c8}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_77/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__78_90__ /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_84/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__85_90__ /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_66/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__67_90__ /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_66/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__67_90__ /var/flink/data/localState/aid_AllocationID{5cde66b8a81c5202f7685928bb18ac00}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_15/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__16_90__ /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_53/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__54_90__ /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_1/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__2_90__ /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_1/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__2_90__ /var/flink/data/localState/aid_AllocationID{61cf4a285199ab779ec85784980d15e2}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_56/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__57_90__ /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_30/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__31_90__ /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_46/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__47_90__ /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_46/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__47_90__ /var/flink/data/localState/aid_AllocationID{a93df5bf51a4f9b3673cd18b46abbecb}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_12/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__13_90__ /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_46/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__47_90__ /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_30/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__31_90__ /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_30/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__31_90__ /var/flink/data/localState/aid_AllocationID{f6241daa33001250c3f2934a8ba6b506}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_79/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__80_90__ /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_54b7f0cbe350c274d942032aa504dbdd_sti_66/chk_21244/rocks_db CoStreamFlatMap_54b7f0cbe350c274d942032aa504dbdd__67_90__ /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_84/chk_21244/rocks_db CoBroadcastWithKeyedOperator_567adb020dcc57a12c17bd43c00b0f55__85_90__ /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_567adb020dcc57a12c17bd43c00b0f55_sti_84/chk_21244/rocks_db StreamMap_3c5866a6cc097b462de842b2ef91910d__85_90__ /var/flink/data/localState/aid_AllocationID{fbf10f2769de14f7328de6aa3c056515}/jid_6241b30b0adb82bd50cd5d37aa6128d1/vtx_bc2936094388a70852534bd6c0fce178_sti_58/chk_21244/rocks_db WindowOperator_bc2936094388a70852534bd6c0fce178__59_90__ After each checkpoint, when I checked the checkpoint directory for the StreamMap operator state, the SST files are not there. Restoring a new job from the same checkpoint or savepoint also confirmed that the StreamMap states were missing, but with no error reported by Flink. I also used strace to capture file I/O during checkpoints. I could see that the StreamMap operator succeeded in creating the checkpoint directory, but immediately after that it received a lot of "-1 ENOENT (No such file or directory)" errors, possibly because the directory was over-written by the other operator. Is this an known issue? It seems that the UUID generation of chained operators are not differentiating the two operators, resulting in data loss. Thanks, Ning