Hi, everyone:
I’m a flink sql user, and the version is 1.8.2. Recently I confuse about memory and backpressure. I have two job on yarn, due to memory over, it’s frequently killed by yarn. One job,I have 3 taskmanagers and 6 parallelism, each one has 8G memory.It read from kafka, one minute tumble windows to calculate pv and uv. There are many aggregation dimensions, to avoid data skew, it group by deviceId,TUMBLE(event_time, INTERVAL '1' MINUTE)。My question is that the checkpoint is just 60MB, I give 24G memory, why it was killed by yarn? I use rocksdb as backend, and data is big, but I think it should trigger backpressure rather than OOM, although it dosen’t trigger. In Pool Usage is 0.45 normally. Another job looks different, I use 2 taskmanagers and 4 parallelism, each one has 20G memory. I define a aggregate functions to calculate complex data, group by date,hour,deviceId. it seems like first job, OOM and no backpressure. but the problem is when I read one day data, just one taskmanager was killed by yarn, I confuse about this. according to dashboard, I don't find data skew, but why just one taskmanager? May be it’s the same question or not, but I want to know more about memory used in flink, and backpressure can stop source or not, and how to trigger it, rocksdb affect on flink. Thanks for reading, it would be better if there were some suggestions.Thank you.