Hello,

We have a job where its main purpose is to track whether or not we've
previously seen a particular event - that's it. If it's new, we save it to
an external database. If we've seen it, we block the write. There's a 3-day
TTL to manage the state size. The downstream db can tolerate new data
slipping through and reject the write - we mainly use the state to reduce
writes.

We're starting to see some performance issues, even after adding 50%
capacity to the job. After some number of days/weeks, it eventually goes
into a constant backpressure situation. I'm wondering if there's something
we can do to improve efficiency.

1. According to the flamegraph, 60-70% of the time is spent in RocksDB.get
2. The state is just a ValueState<Boolean>. I assume this is the
smallest/most efficient state. The keyby is extremely high cardinality -
are we better off with a lower cardinality and a MapState<String, Boolean>
.contains() check?
3. Current configs: taskmanager.memory.process.size:
4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't
see much change)
4. Estimated num keys tops out somewhere around 9-10B. Estimated live data
size somewhere around 250 GB. Attempting to switch to heap state
immediately ran into OOM (parallelism: 120, 8gb memory each).

And perhaps the answer is just "scale out" :) but if there are any signals
to know when we've reached the limit of current scale, it'd be great to
know what signals to look for!

Thanks!
Trystan

Reply via email to