All,

We recently moved our Checkpoint directory from HDFS to local SSDs mounted on 
Data Nodes (we were starting to see perf impacts on checkpoints etc as complex 
ML apps were spinning up more and more in YARN). This worked great other than 
the fact that when jobs are being canceled or canceled with Savepoint, local 
data is not being cleaned up. In HDFS, Checkpoint directories were cleaned up 
on Cancel and Cancel with Savepoints as far as I can remember. I am wondering 
if it is permissions issue. Local disks have RWX permissions for both yarn and 
flink headless users (flink headless user submits the apps to YARN using our 
CICD pipeline). 

Appreciate any pointers on this.

Thanks, Ashish

Reply via email to