Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread ashish pok
Stefan,  Can’t thank you enough for this write-up. This is awesome explanation. I had misunderstood concepts of RocksDB working directory and Checkpoint FS. My main intent is to boost performance of RocksDB with SSD available locally. Recovery time from HDFS is not much of a concern but load on

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread Stefan Richter
Hi, ok, let me briefly explain the differences between local working director, checkpoint directory, and savepoint directory and also outline their best practises/requirements/tradeoffs. First easy comment is that typically checkpoints and savepoints have similar requirements and most users wri

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread ashish pok
Sorry, Just a follow-up. In absence of NAS then the best option to go with here is checkpoint and savepoints both on HDFS and StateBackend using local SSDs then? We were trying to not even hit HDFS other than for savepoints. - Ashish On Monday, July 23, 2018, 7:45 AM, ashish pok wrote: Stefan

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread ashish pok
Stefan, I did have first point at the back of my mind. I was under the impression though for checkpoints, cleanup would be done by TMs as they are being taken by TMs. So for a standalone cluster with its own zookeeper for JM high availability, a NAS is a must have? We were going to go with local

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread Stefan Richter
Hi, I am wondering how this can even work properly if you are using a local fs for checkpoints instead of a distributed fs. First, what happens under node failures, if the SSD becomes unavailable or if a task gets scheduled to a different machine, and can no longer access the disk with the cor

Permissions to delete Checkpoint on cancel

2018-07-22 Thread Ashish Pokharel
All, We recently moved our Checkpoint directory from HDFS to local SSDs mounted on Data Nodes (we were starting to see perf impacts on checkpoints etc as complex ML apps were spinning up more and more in YARN). This worked great other than the fact that when jobs are being canceled or canceled