Stefan, I did have first point at the back of my mind. I was under the impression though for checkpoints, cleanup would be done by TMs as they are being taken by TMs. So for a standalone cluster with its own zookeeper for JM high availability, a NAS is a must have? We were going to go with local checkpoints with access to remote HDFS for savepoints. This sounds like it will be a bad idea then. Unfortunately we can’t run on YARN and NAS is also a no-no in one of our datacenters - there is a mountain of security complainace to climb before we will in Production if we need to go that route. Thanks, Ashish
On Monday, July 23, 2018, 5:10 AM, Stefan Richter <s.rich...@data-artisans.com> wrote: Hi, I am wondering how this can even work properly if you are using a local fs for checkpoints instead of a distributed fs. First, what happens under node failures, if the SSD becomes unavailable or if a task gets scheduled to a different machine, and can no longer access the disk with the corresponding state data, or if you want to scale-out. Second, the same problem is also what you can observe with the job manager: how could the checkpoint coordinator, that runs on the JM, access a file on a local FS on a different node to cleanup the checkpoint data? The purpose of using a distributed fs here is that all TM and the JM can access the checkpoint files. Best, Stefan > Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashish...@yahoo.com>: > > All, > > We recently moved our Checkpoint directory from HDFS to local SSDs mounted on > Data Nodes (we were starting to see perf impacts on checkpoints etc as > complex ML apps were spinning up more and more in YARN). This worked great > other than the fact that when jobs are being canceled or canceled with > Savepoint, local data is not being cleaned up. In HDFS, Checkpoint > directories were cleaned up on Cancel and Cancel with Savepoints as far as I > can remember. I am wondering if it is permissions issue. Local disks have RWX > permissions for both yarn and flink headless users (flink headless user > submits the apps to YARN using our CICD pipeline). > > Appreciate any pointers on this. > > Thanks, Ashish