Hi Folks, We created a stateful job using SessionWindow and RocksDB state backend and deployed it on Kubernetes Statefulset with persisted volumes. The Flink version we used is 1.14.
After the job runs for some time, we observed that the size of the local RocksDB directory started to grow and there are more and more directories created inside it. It seems that when the job is restarted or the task manager K8s pod is restarted, the previous RocksDB directory corresponding to the assigned operator is not cleaned up. Here is an example: drwxr-xr-x 3 root root 4096 Jun 27 18:23 job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__1_4__uuid_c97f3f3f-649a-467d-82af-2bc250ec6e22 drwxr-xr-x 3 root root 4096 Jun 27 18:45 job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__1_4__uuid_e4fca2c3-74c7-4aa2-9ca1-dda866b8de11 drwxr-xr-x 3 root root 4096 Jun 27 18:56 job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__2_4__uuid_f1f7777a-7402-494d-80d7-65861394710c drwxr-xr-x 3 root root 4096 Jun 27 17:34 job_00000000000000000000000000000000_op_WindowOperator_f6dc7f4d2283f4605b127b9364e21148__3_4__uuid_08a14423-bea1-44ce-96ee-360a516d72a6 Although only job_00000000000000000000000000000000_op_WindowOperator_2b0a50a068bb7f1c8a470e4f763cbf26__2_4__uuid_f1f7777a-7402-494d-80d7-65861394710c is the active running operator, the other directories for the past operators still exist. We set up the task manager property taskmanager.resource-id to be the task manager pod name under the statefulset but it did not seem to help cleaning up previous directories. Any pointers to solve this issue? We checked the latest document and it seems that Flink 1.15 introduced the concept of local working directory: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/working_directory/. Does that help cleaning up the RocksDB directory? Thanks, Allen