TTL issue with large RocksDB keyed state

Cliff Resnick Sun, 02 Jun 2024 11:44:09 -0700

Hi everyone,


We have a Flink application that has a very large and perhaps unusual
state. The basic shape of it is a very large and somewhat random
keyed-stream partition space, each with a continuously growing map-state
keyed by microsecond time Long values. There are never any overwrites in
the map state which is monotonic per partition key.  Map state was chosen
over list state in the hope that we can manage a sliding window using TTL.
Using RocksDB incremental checkpointing, the app runs very well despite the
large total checkpoint size. Our current checkpoint size is 3.2TB.


We have multiple questions around space amplification problems when using
the RocksDB backend and I'm wondering if anyone can suggest or confirm
answers.



1. Using LEVEL compaction we have not seen any decrease in total checkpoint
size through TTL compaction. To test the TTL, I cut the period from 60 to
30 days (we have well over 60 days processing time), enabled
cleanupFullSnapshot() and ran a test job without incremental checkpointing
enabled. After multiple full checkpoints and a NATIVE savepoint the size
was unchanged. I'm wondering if RocksDb compaction is  because we never
update key values? The state is nearly fully composed of keys' space. Do
keys not get freed using RocksDb compaction filter for TTL?

2. I'm wondering if FIFO compaction is a solution for above. To move to
that that we will need to first take a canonical savepoint then redeploy
with RocksDB/FIFO. That should work but will doing that "reset the clock"
for the TTL? Given it's nature though, I am leaning to this as our only
option.


3. Rescaling is a problem because of this issue:
https://issues.apache.org/jira/browse/FLINK-34050. The fix for this is not
yet released. Because of this bug  the checkpoint size scales somewhat
larger than is proportionate to the job rescaling. For example if we go
from 44 slots to 60, the checkpoint will scale from 3.2 TB to 4.9 TB.
Before 1.19.1 is released can cherry-pick the fix and create our own Docker
image, or will restoring from a canonical savepoint as described above
sidestep this bug?


If anyone can help with any insights, please do!

TTL issue with large RocksDB keyed state

Reply via email to