Hi everyone,
 
I am currently dealing with an OOM issue caused by continuous queries in Flink 
SQL (for example, a GROUP BY clause without a window). After looking into 
Flink's Runtime & Core module, I found that as time goes by, StateTables in the 
HeapKeyedStateBackend grows larger and larger, because there are too many 
potential keys to be put into the memory (and GC is taking up most of the CPU 
time, usually for 20 seconds and more), eventually crashing the job or even the 
entire JobManager.
 
Therefore, I am now thinking of implementing a "spillable" StateTable that 
could spill the keys together with values into disk (or StateBackends like 
RocksDB) when heap memory is going to deplete. And also when memory pressure 
alleviates, the entries could be put back. Therefore, contrary to using RocksDB 
alone, Flink's throughput would not be affected if there is no need for a 
"spill" (when heap memory is enough).
 
I am not sure if the Flink community has some plans in dealing with the OOM 
issue caused by large number of keys & states in the heap (except for using 
RocksDB alone because the throughput would be rather slow when heap memory is 
enough), and whether my plan has any serious flaws, making it not worth doing, 
like making checkpointing process harder to be consistent, or greatly reducing 
the throughput because of the additional lookup cost, etc.?

Hope to get any feedbacks, and thank you in advance, especially Fabian who 
recommended me to this mailing list : )
 
Sincerely,
Kyle



[email protected]

Reply via email to