Hi, everyone:
I’m a flink sql user, and the version is 1.8.2.
Recently I confuse about memory and backpressure. I have two job on yarn,
due to memory over, it’s frequently killed by yarn.
One job,I have 3 taskmanagers and 6 parallelism, each one has 8G memory.It read
from kafka, one minute tumble windows to calculate pv and uv. There are many
aggregation dimensions, to avoid data skew, it group by
deviceId,TUMBLE(event_time, INTERVAL '1' MINUTE)。My question is that the
checkpoint is just 60MB, I give 24G memory, why it was killed by yarn? I use
rocksdb as backend, and data is big, but I think it should trigger backpressure
rather than OOM, although it dosen’t trigger. In Pool Usage is 0.45 normally.
Another job looks different, I use 2 taskmanagers and 4 parallelism, each one
has 20G memory. I define a aggregate functions to calculate complex data, group
by date,hour,deviceId. it seems like first job, OOM and no backpressure. but
the problem is when I read one day data, just one taskmanager was killed by
yarn, I confuse about this. according to dashboard, I don't find data skew, but
why just one taskmanager?
May be it’s the same question or not, but I want to know more about memory used
in flink, and backpressure can stop source or not, and how to trigger it,
rocksdb affect on flink.
Thanks for reading, it would be better if there were some suggestions.Thank you.