Hi Abdul,
in Flink 1.4 we use Akka's death watch to detect no longer reachable hosts.
The downside of the death watch mechanism is that hosts which were detected
to be dead are being quarantined. Once in this state you need to restart
the ActorSystem in order to receive messages again. The idea be
We were able to fix it by passing IP address instead of hostname for actor
system listen address when starting taskmanager:
def runTaskManager(
taskManagerHostname: String,
resourceID: ResourceID,
actorSystemPort: Int,
It is hard to tell without all logs but it could easily be a K8s setup
problem. Also problematic is that you are running a Flink version which is
no longer actively supported. Try at least to use the latest bug fix
release for 1.4.
Cheers,
Till
On Fri, Oct 12, 2018, 09:43 Abdul Qadeer wrote:
>
Hi Till,
A few more data points:
In a rerun of the same versions with fresh deployment, I see
*log*.debug(*s"RegisterTaskManager:
$*msg*"*) in JobManager, however the
*AcknowledgeRegistration/AlreadyRegistered *messages are never sent, I have
taken tcpdump for the taskmanager which doesn't recove
Hi Till,
I didn't try with newer versions as it is not possible to update the Flink
version atm.
If you could give any pointers for debugging that would be great.
On Thu, Oct 11, 2018 at 2:44 AM Till Rohrmann wrote:
> Hi Abdul,
>
> have you tried whether this problem also occurs with newer Flin
Hi Abdul,
have you tried whether this problem also occurs with newer Flink versions
(1.5.4 or 1.6.1)?
Cheers,
Till
On Thu, Oct 11, 2018 at 9:24 AM Dawid Wysakowicz
wrote:
> Hi Abdul,
>
> I've added Till and Gary to cc, who might be able to help you.
>
> Best,
>
> Dawid
>
> On 11/10/18 03:05, A
Hi Abdul,
I've added Till and Gary to cc, who might be able to help you.
Best,
Dawid
On 11/10/18 03:05, Abdul Qadeer wrote:
>
> Hi,
>
>
> We are facing an issue in standalone HA mode in Flink 1.4.0 where
> Taskmanager restarts and is not able to register with the Jobmanager.
> It times out awa
Hi,
We are facing an issue in standalone HA mode in Flink 1.4.0 where
Taskmanager restarts and is not able to register with the Jobmanager. It
times out awaiting *AcknowledgeRegistration/AlreadyRegistered* message from
Jobmanager Actor and keeps sending *RegisterTaskManager *message. The logs
at