Hi Everyone,

We have hit a roadblock moving an app at Production scale and was hoping to get 
some guidance. Application is pretty common use case in stream processing but 
does require maintaining large number of keyed states. We are processing 2 
streams - one of which is a daily burst of stream (normally around 50 mil but 
could go upto 100 mil in one hour burst) and other is constant stream of around 
70-80 mil per hour. We are doing a low level join using CoProcess function 
between the two keyed streams. CoProcess function needs to refresh (upsert) 
state from the daily burst stream and decorate constantly streaming data with 
values from state built using bursty stream. All of the logic is working pretty 
well in a standalone Dev environment. We are throwing about 500k events of 
bursty traffic for state and about 2-3 mil of data stream. We have 1 TM with 
16GB memory, 1 JM with 8 GB memory and 16 slots (1 per core on the server) on 
the server. We have been taking savepoints in case we need to restart app for 
with code changes etc. App does seem to recover from state very well as well. 
Based on the savepoints, total volume of state in production flow should be 
around 25-30GB. 

At this point, however, we are trying deploy the app at production scale. App 
also has a flag that can be set at startup time to ignore data stream so we can 
simply initialize state. So basically we are trying to see if we can initialize 
the state first and take a savepoint as test. At this point we are using 10 TM 
with 4 slots and 8GB memory each (idea was to allocate around 3 times estimated 
state size to start with) but TMs keep getting killed by YARN with a GC 
Overhead Limit Exceeded error. We have gone through quite a few blogs/docs on 
Flink Management Memory, off-heap vs heap memory, Disk Spill over, State 
Backend etc. We did try to tweak managed-memory configs in multiple ways 
(off/on heap, fraction, network buffers etc) but can’t seem to figure out good 
way to fine tune the app to avoid issues. Ideally, we would hold state in 
memory (we do have enough capacity in Production environment for it) for 
performance reasons and spill over to disk (which I believe Flink should 
provide out of the box?). It feels like 3x anticipated state volume in cluster 
memory should have been enough to just initialize state. So instead of just 
continuing to increase memory (which may or may not help as error is regarding 
GC overhead) we wanted to get some input from experts on best practices and 
approach to plan this application better. 

Appreciate your input in advance!

Reply via email to