Re: Permissions to delete Checkpoint on cancel

ashish pok Mon, 23 Jul 2018 04:46:09 -0700

Stefan,
I did have first point at the back of my mind. I was under the impression 
though for checkpoints, cleanup would be done by TMs as they are being taken by 
TMs.
So for a standalone cluster with its own zookeeper for JM high availability, a 
NAS is a must have? We were going to go with local checkpoints with access to 
remote HDFS for savepoints. This sounds like it will be a bad idea then. 
Unfortunately we can’t run on YARN and NAS is also a no-no in one of our 
datacenters - there is a mountain of security complainace to climb before we 
will in Production if we need to go that route.
Thanks, Ashish

On Monday, July 23, 2018, 5:10 AM, Stefan Richter <s.rich...@data-artisans.com> 
wrote:

Hi,

I am wondering how this can even work properly if you are using a local fs for 
checkpoints instead of a distributed fs. First, what happens under node 
failures, if the SSD becomes unavailable or if a task gets scheduled to a 
different machine, and can no longer access the disk with the  corresponding 
state data, or if you want to scale-out. Second, the same problem is also what 
you can observe with the job manager: how could the checkpoint coordinator, 
that runs on the JM, access a file on a local FS on a different node to cleanup 
the checkpoint data? The purpose of using a distributed fs here is that all TM 
and the JM can access the checkpoint files.

Best,
Stefan

> Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashish...@yahoo.com>:
> 
> All,
> 
> We recently moved our Checkpoint directory from HDFS to local SSDs mounted on 
> Data Nodes (we were starting to see perf impacts on checkpoints etc as 
> complex ML apps were spinning up more and more in YARN). This worked great 
> other than the fact that when jobs are being canceled or canceled with 
> Savepoint, local data is not being cleaned up. In HDFS, Checkpoint 
> directories were cleaned up on Cancel and Cancel with Savepoints as far as I 
> can remember. I am wondering if it is permissions issue. Local disks have RWX 
> permissions for both yarn and flink headless users (flink headless user 
> submits the apps to YARN using our CICD pipeline). 
> 
> Appreciate any pointers on this.
> 
> Thanks, Ashish

Re: Permissions to delete Checkpoint on cancel

Reply via email to