B. Micheal Okutubo created SPARK-51717:
------------------------------------------

             Summary: Possible SST mismatch error for the second snapshot 
created for a new query
                 Key: SPARK-51717
                 URL: https://issues.apache.org/jira/browse/SPARK-51717
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.0, 4.1.0
            Reporter: B. Micheal Okutubo


An edge case in SST file reuse that can only happen for the first ever RocksDB 
checkpoint if:
 # The first ever RocksDB checkpoint (e.g. for version 10) was created with 
x.sst, but not yet upload by maintenance

 # The next batch using RocksDB at v10 fails and rolls back store to -1 
(invalidates RocksDB)

 # A new request to load RocksDB at v10 comes in, but v10 checkpoint is still 
not uploaded hence we have to start replaying changelog starting from 
checkpoint v0.

 # We create a new v11 and new checkpoint with new x*.sst. v10 is now uploaded 
by maintenance. Then during upload of x*.sst for v11, we reuse x.sst DFS file, 
thinking it is the same as x*.sst.

The problem here is from step 3, the way the file manager loads v0 is different 
from how it loads other versions. During the load of other versions, when we 
delete an existing local file we also delete it from file mapping. But for v0, 
file manager just deletes the local dir and we missed clearing the file mapping 
in this case. Hence the old x.sst was still showing in the file mapping at step 
4. We need to fix this and also add additional size check.

 

Only when using changelog checkpointing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to