Hi,
we have enabled HA as suggested, the task manager tries to reach the job
manager via pod id as expected but
the task manager is unable to connect to the job manager:

2022-06-19 22:14:45,101 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] -
Connecting to ResourceManager akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)
.

2022-06-19 22:14:45,242 WARN
akka.remote.transport.netty.NettyTransport                   [] -
Remote connection to [/
192.168.3.144:6123
] failed with java.io.IOException: Connection reset by peer

2022-06-19 22:14:45,249 WARN  akka.remote.ReliableDeliverySupervisor
                    [] - Association with remote system [akka.tcp://
flink@192.168.3.144:6123
] has failed, address is now gated for [50] ms. Reason: [Association
failed with [akka.tcp://
flink@192.168.3.144:6123
]] Caused by: [The remote system explicitly disassociated (reason unknown).]

2022-06-19 22:14:45,255 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] -
Could not resolve ResourceManager address akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0
, retrying in 10000 ms: Could not connect to rpc endpoint under
address akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0.

2022-06-


Are there any additional definitions required for that?


thanks

Sigalit

On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <danrtsey...@gmail.com> wrote:

> Could you please have a try with high availability enabled[1]?
>
> If HA enabled, the internal jobmanager rpc service will not be created.
> Instead, the TaskManager retrieves the JobManager address via HA services
> and connects to it via pod ip.
>
> [1].
> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
>
>
> Best,
> Yang
>
> Elisha, Moshe (Nokia - IL/Kfar Sava) <moshe.eli...@nokia.com>
> 于2022年6月16日周四 15:24写道:
>
>> Hello,
>>
>>
>>
>> We are launching Flink deployments using the Flink Kubernetes Operator
>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
>> on a Kubernetes cluster with Istio and mTLS enabled.
>>
>>
>>
>> We found that the TaskManager is unable to communicate with the
>> JobManager on the jobmanager-rpc port:
>>
>>
>>
>> 2022-06-15 15:25:40,508 WARN  akka.remote.ReliableDeliverySupervisor
>>                   [] - Association with remote system
>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
>> Caused by: [The remote system explicitly disassociated (reason unknown).]
>>
>>
>>
>> The reason for the issue is that the JobManager service port definitions are
>> not following the Istio guidelines
>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
>> (see example below).
>>
>>
>>
>> We believe a change to the default port definitions is needed but for
>> now, is there an immediate action we can take to work around the issue?
>> Perhaps overriding the default port definitions somehow?
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> flink-kubernetes-operator 1.0.0
>>
>> Flink 1.14-java11
>>
>> Kubernetes v1.19.5
>>
>> Istio 1.7.6
>>
>>
>>
>>
>>
>> # k get service inference-results-to-analytics-engine -o yaml
>>
>> apiVersion: v1
>>
>> kind: Service
>>
>> metadata:
>>
>> ...
>>
>>   labels:
>>
>>     app: inference-results-to-analytics-engine
>>
>>     type: flink-native-kubernetes
>>
>>   name: inference-results-to-analytics-engine
>>
>> spec:
>>
>>   clusterIP: None
>>
>>   ports:
>>
>>   - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
>> property
>>
>>     port: 6123
>>
>>     protocol: TCP
>>
>>     targetPort: 6123
>>
>>   - name: blobserver # should start with "tcp-" or add "appProtocol"
>> property
>>
>>     port: 6124
>>
>>     protocol: TCP
>>
>>     targetPort: 6124
>>
>>   selector:
>>
>>     app: inference-results-to-analytics-engine
>>
>>     component: jobmanager
>>
>>     type: flink-native-kubernetes
>>
>>   sessionAffinity: None
>>
>>   type: ClusterIP
>>
>> status:
>>
>>   loadBalancer: {}
>>
>>
>>
>

Reply via email to