All, We recently moved our Checkpoint directory from HDFS to local SSDs mounted on Data Nodes (we were starting to see perf impacts on checkpoints etc as complex ML apps were spinning up more and more in YARN). This worked great other than the fact that when jobs are being canceled or canceled with Savepoint, local data is not being cleaned up. In HDFS, Checkpoint directories were cleaned up on Cancel and Cancel with Savepoints as far as I can remember. I am wondering if it is permissions issue. Local disks have RWX permissions for both yarn and flink headless users (flink headless user submits the apps to YARN using our CICD pipeline).
Appreciate any pointers on this. Thanks, Ashish