Hi Folks,

We created a stateful job using SessionWindow and RocksDB state backend and
deployed it on Kubernetes Statefulset with persisted volumes. The Flink
version we used is 1.14.

After the job runs for some time, we observed that the size of the local
RocksDB directory started to grow and there are more and more
directories created inside it. It seems that when the job is restarted or
the task manager K8s pod is restarted, the previous RocksDB directory
corresponding to the assigned operator is not cleaned up. Here is an
example:

drwxr-xr-x 3 root root 4096 Jun 27 18:23
job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__1_4__uuid_c97f3f3f-649a-467d-82af-2bc250ec6e22
drwxr-xr-x 3 root root 4096 Jun 27 18:45
job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__1_4__uuid_e4fca2c3-74c7-4aa2-9ca1-dda866b8de11
drwxr-xr-x 3 root root 4096 Jun 27 18:56
job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__2_4__uuid_f1f7777a-7402-494d-80d7-65861394710c
drwxr-xr-x 3 root root 4096 Jun 27 17:34
job_00000000000000000000000000000000_op_WindowOperator_f6dc7f4d2283f4605b127b9364e21148__3_4__uuid_08a14423-bea1-44ce-96ee-360a516d72a6

Although only
job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__2_4__uuid_f1f7777a-7402-494d-80d7-65861394710c
is the active running operator, the other directories for the past
operators still exist.

We set up the task manager property taskmanager.resource-id to be the task
manager pod name under the statefulset but it did not seem to help cleaning
up previous directories.

Any pointers to solve this issue?

We checked the latest document and it seems that Flink 1.15 introduced the
concept of local working directory:
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/working_directory/.
Does that help cleaning up the RocksDB directory?

Thanks,
Allen

Reply via email to