Hi,
since your state (150gb) seems to fit into memory (700gb), I would
recommend trying the HashMapStateBackend:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/state_backends/#the-hashmapstatebackend
(unless you know that your state size is going to increase a lot soon
Hi, Xintong and Robert
Thanks for the reply.
The checkpoint size for our job is 10-20GB since we are doing incremental
checkpointing, if we do a savepoint, it can be as big as 150GB.
1) We will try to make Flink instance bigger.
2) Thanks for the pointer, we will take a look.
3) We do have CPU
Hi Thomas,
My gut feeling is that you can use the available resources more efficiently.
What's the size of a checkpoint for your job (you can see that from the
UI)?
Given that your cluster has has an aggregate of 64 * 12 = 768gb of memory
available, you might be able to do everything in memory (
Hi Thomas,
It would be helpful if you can provide the jobmanager/taskmanager logs, and
gc logs if possible.
Additionally, you may consider to monitor the cpu/memory related metrics
[1], see if there's anything abnormal when the problem is observed.
Thank you~
Xintong Song
[1]
https://ci.apach
Hi,
I'm trying to see if we have been given enough resources (i.e. CPU and
memory) to each task node to perform a deduplication job. Currently, the
job is not running very stable. What I have been observing is that after a
couple of days run, we will suddenly see backpressure happen on one
arbitra