Re: AKA and quarantine

2018-01-30 Thread Till Rohrmann
If you don't run Flink in standalone mode, then you can activate taskmanager.exit-on-fatal-akka-error. However, keep in mind that at some point you might run out of spare TMs to run your jobs unless you restart them manually. Cheers, Till On Mon, Jan 29, 2018 at 6:41 PM, Vishal Santoshi wrote:

Re: AKA and quarantine

2018-01-29 Thread Vishal Santoshi
>> If you enable taskmanager.exit-on-fatal-akka-error, then it will stop TMs which got quarantined. This will automatically restart TMs in case that you are running Flink on Yarn. Thus, I would recommend enabling this if possible We do not use yarn. This would end up restarting the jobs on the rem

Re: AKA and quarantine

2018-01-29 Thread Till Rohrmann
Hi Vishal, Akka usually quarantines remote ActorSystems in case of a system message delivery failure or if the death watch was triggered. This can, for example, happen if your machine is under heavy load or has a high GC pressure and does not find enough time to respond to the heartbeats. - If yo

Re: AKA and quarantine

2018-01-29 Thread Vishal Santoshi
Thank you. On Mon, Jan 29, 2018 at 3:17 AM, Fabian Hueske wrote: > Hi Vishal, > > sorry for the late response. > Till (in CC) might be able to answer your Akka / coordination related > questions. > > Best, Fabian > > 2018-01-24 1:22 GMT+01:00 Vishal Santoshi : > >> Any suggestions ? I know thes

Re: AKA and quarantine

2018-01-29 Thread Fabian Hueske
Hi Vishal, sorry for the late response. Till (in CC) might be able to answer your Akka / coordination related questions. Best, Fabian 2018-01-24 1:22 GMT+01:00 Vishal Santoshi : > Any suggestions ? I know these are very general issue but these are edge > conditions that we want the community t

Re: AKA and quarantine

2018-01-23 Thread Vishal Santoshi
Any suggestions ? I know these are very general issue but these are edge conditions that we want the community to give us general advise on .. On Sun, Jan 21, 2018 at 3:16 PM, Vishal Santoshi wrote: > There have been a couple of instances where one of our TMs was quarantined > ( the cause is ir

AKA and quarantine

2018-01-21 Thread Vishal Santoshi
There have been a couple of instances where one of our TMs was quarantined ( the cause is irrelevant to this discussion ). And we had to bounce the TM to bring back sanity to the cluster. There have been discussions around and am trying to distill them. My questions are * Based on https://issu