Hi Vishal, sorry for the late response. Till (in CC) might be able to answer your Akka / coordination related questions.
Best, Fabian 2018-01-24 1:22 GMT+01:00 Vishal Santoshi <vishal.santo...@gmail.com>: > Any suggestions ? I know these are very general issue but these are edge > conditions that we want the community to give us general advise on .. > > On Sun, Jan 21, 2018 at 3:16 PM, Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> There have been a couple of instances where one of our TMs was >> quarantined ( the cause is irrelevant to this discussion ). And we had to >> bounce the TM to bring back sanity to the cluster. There have been >> discussions around and am trying to distill them. My questions are >> >> >> * Based on https://issues.apache.org/jira/browse/FLINK-3347 is it >> advisable to set the taskmanager.exit-on-fatal-akka-error to true. ? >> >> * Is the akka.ask.timeout relevant here ? We could increase the value to >> greater than 10s but based on your experiences is it more of a "mask the >> issue" exercise or is 10s generally a low value that *should* be >> increased ? >> >> * Is it possible or is there some effort being put into per job >> memory/resource consumption for a multi job setup that is very normal with >> flink ? >> >> * Is there an effort to monitor ROCKSDB useage ( off heap and what not ) >> ? It seems a black box to a user as of today. >> >> Thank you and regards. >> >> >> >> >> >