Hi, Some more tests results from the task/job managers pod : From task manager I cannot connect to job manager: root@basic-example-taskmanager-5d54f9f94-rbcr4:/opt/flink# wget basic-example.default:6123 --2024-01-12 15:16:15-- http://basic-example.default:6123/ Resolving basic-example.default (basic-example.default)... 100.64.3.182 Connecting to basic-example.default (basic-example.default)|100.64.3.182|:6123... ^C
From job manager I can (DNS is OK, same IP is given) : root@basic-example-57774f887d-6bht8:/opt/flink# wget basic-example.default:6123 --2024-01-12 15:16:25-- http://basic-example.default:6123/ Resolving basic-example.default (basic-example.default)... 100.64.3.182 Connecting to basic-example.default (basic-example.default)|100.64.3.182|:6123... connected. HTTP request sent, awaiting response... No data received. Retrying. However services are created : basic-example ClusterIP None <none> 6123/TCP,6124/TCP 2s basic-example-rest ClusterIP 100.87.240.180 <none> 8081/TCP 2s Maybe the job manager only listens on localhost instead of 0.0.0.0 or its real IP ? Is it something I have the hand on? Thanks, Arnaud From: LINZ, Arnaud Sent: Friday, January 12, 2024 2:07 PM To: user@flink.apache.org Subject: FW: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed, Hello, I am trying to follow the “quickstart” guide on a GKE Autopilot k8s cluster. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/ I could install the operator (without webhook) without issue ; however, when running kubectl create -f https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.7/examples/basic.yaml The job does not work because the task manager does not reach the job manager (maybe a DNS issue?). Is there some special dns/network configuration to perform in GKE? Has anybody already made it work? Thanks, Arnaud Log in job manager is : 2024-01-12 11:01:56,878 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source (1/2) (c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_0_2) switched from CREATED to SCHEDULED. 2024-01-12 11:01:56,878 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source (2/2) (c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_1_2) switched from CREATED to SCHEDULED. 2024-01-12 11:01:56,878 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map -> Sink: Print to Std. Out (1/2) (c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_0_2) switched from CREATED to SCHEDULED. 2024-01-12 11:01:56,878 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map -> Sink: Print to Std. Out (2/2) (c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_1_2) switched from CREATED to SCHEDULED. 2024-01-12 11:01:56,879 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] 2024-01-12 11:01:56,880 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=2}] 2024-01-12 11:01:56,902 INFO org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint triggering task Source: Custom Source (1/2) of job 096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running.. 2024-01-12 11:01:57,014 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need request 1 new workers, current worker number 0, declared worker number 1 2024-01-12 11:01:57,015 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes), numSlots=2}, current pending count: 1. 2024-01-12 11:01:57,016 INFO org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: [] 2024-01-12 11:01:57,018 INFO org.apache.flink.configuration.Configuration [] - Config uses fallback configuration key 'kubernetes.service-account' instead of key 'kubernetes.taskmanager.service-account' 2024-01-12 11:01:57,022 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating new TaskManager pod with name basic-example-taskmanager-1-3 and resource <2048,1.0>. 2024-01-12 11:01:57,095 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod basic-example-taskmanager-1-3 is created. 2024-01-12 11:01:57,116 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new TaskManager pod: basic-example-taskmanager-1-3 2024-01-12 11:01:57,117 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker basic-example-taskmanager-1-3 with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 bytes), numSlots=2}. 2024-01-12 11:01:58,902 INFO org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint triggering task Source: Custom Source (1/2) of job 096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running.. (…) Log in task manager is : (…) 2024-01-12 11:02:02,229 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-jmx 2024-01-12 11:02:02,232 INFO org.apache.flink.runtime.state.changelog.StateChangelogStorageLoader [] - StateChangelogStorageLoader initialized with shortcut names {memory,filesystem}. 2024-01-12 11:02:02,252 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2024-01-12 11:02:02,325 INFO org.apache.flink.runtime.security.modules.JaasModule [] - Jaas file will be created as /tmp/jaas-3174943888264039421.conf. 2024-01-12 11:02:02,334 INFO org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2024-01-12 11:02:02,929 INFO org.apache.flink.configuration.Configuration [] - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address' 2024-01-12 11:02:02,939 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - Trying to select the network interface and address to use by connecting to the leading JobManager. 2024-01-12 11:02:02,940 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils [] - TaskManager will try to connect for PT10S before falling back to heuristics 2024-01-12 11:02:05,826 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Trying to connect to address basic-example.default/100.64.3.37:6123 2024-01-12 11:02:06,027 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout [200] due to: connect timed out 2024-01-12 11:02:06,079 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:06,131 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:06,182 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [50] due to: connect timed out 2024-01-12 11:02:07,185 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [1000] due to: connect timed out 2024-01-12 11:02:08,187 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [1000] due to: connect timed out 2024-01-12 11:02:08,287 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Trying to connect to address basic-example.default/100.64.3.37:6123 2024-01-12 11:02:08,489 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout [200] due to: connect timed out 2024-01-12 11:02:08,541 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:08,592 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:08,643 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [50] due to: connect timed out 2024-01-12 11:02:09,645 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [1000] due to: connect timed out 2024-01-12 11:02:10,648 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [1000] due to: connect timed out 2024-01-12 11:02:10,849 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Trying to connect to address basic-example.default/100.64.3.37:6123 2024-01-12 11:02:11,051 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout [200] due to: connect timed out 2024-01-12 11:02:11,103 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:11,155 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [50] due to: connect timed out 2024-01-12 11:02:11,205 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [50] due to: connect timed out 2024-01-12 11:02:12,208 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/100.64.3.40] with timeout [1000] due to: connect timed out 2024-01-12 11:02:13,210 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect to [basic-example.default/100.64.3.37:6123] from local address [/127.0.0.1] with timeout [1000] due to: connect timed out 2024-01-12 11:02:13,211 WARN org.apache.flink.runtime.net.ConnectionUtils [] - Could not connect to basic-example.default/100.64.3.37:6123. Selecting a local address using heuristics. 2024-01-12 11:02:13,212 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - TaskManager will use hostname/address 'basic-example-taskmanager-1-3' (100.64.3.40) for communication. 2024-01-12 11:02:13,331 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address 100.64.3.40:6122, bind address 0.0.0.0:6122. 2024-01-12 11:02:14,832 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started 2024-01-12 11:02:14,927 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled. 2024-01-12 11:02:14,928 INFO akka.remote.Remoting [] - Starting remoting 2024-01-12 11:02:15,252 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink@100.64.3.40:6122] 2024-01-12 11:02:15,642 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink@100.64.3.40:6122 2024-01-12 11:02:15,738 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Using working directory: WorkingDirectory(/tmp/tm_basic-example-taskmanager-1-3) 2024-01-12 11:02:15,826 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl [] - No metrics reporter configured, no metrics will be exposed/reported. 2024-01-12 11:02:15,832 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to start actor system, external address 100.64.3.40:0, bind address 0.0.0.0:0. 2024-01-12 11:02:15,928 INFO akka.event.slf4j.Slf4jLogger [] - Slf4jLogger started 2024-01-12 11:02:15,937 INFO akka.remote.RemoteActorRefProvider [] - Akka Cluster not in use - enabling unsafe features anyway because `akka.remote.use-unsafe-remote-features-outside-cluster` has been enabled. 2024-01-12 11:02:15,938 INFO akka.remote.Remoting [] - Starting remoting 2024-01-12 11:02:16,019 INFO akka.remote.Remoting [] - Remoting started; listening on addresses :[akka.tcp://flink-metrics@100.64.3.40:43773] 2024-01-12 11:02:16,037 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system started at akka.tcp://flink-metrics@100.64.3.40:43773 2024-01-12 11:02:16,118 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.metrics.dump.MetricQueryService at akka://flink-metrics/user/rpc/MetricQueryService_basic-example-taskmanager-1-3 . 2024-01-12 11:02:16,141 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Created BLOB cache storage directory /tmp/tm_basic-example-taskmanager-1-3/blobStorage 2024-01-12 11:02:16,148 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Created BLOB cache storage directory /tmp/tm_basic-example-taskmanager-1-3/blobStorage 2024-01-12 11:02:16,216 INFO org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: [] 2024-01-12 11:02:16,218 INFO org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] - Loading delegation token receivers 2024-01-12 11:02:16,224 INFO org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] - Delegation token receiver hadoopfs loaded and initialized 2024-01-12 11:02:16,225 INFO org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] - Delegation token receiver hbase loaded and initialized 2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-datadog 2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-statsd 2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-slf4j 2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-graphite 2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-prometheus 2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: external-resource-gpu 2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-influx 2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager [] - Plugin loader with ID found, reusing it: metrics-jmx 2024-01-12 11:02:16,228 INFO org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] - Delegation token receivers loaded successfully 2024-01-12 11:02:16,228 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Starting TaskManager with ResourceID: basic-example-taskmanager-1-3 2024-01-12 11:02:16,254 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices [] - Temporary file directory '/tmp': total 94 GB, usable 88 GB (93.62% usable) 2024-01-12 11:02:16,258 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Created a new FileChannelManager for spilling of task related data to disk (joins, sorting, ...). Used directories: /tmp/flink-io-d2303d34-47ac-4a6f-a1dd-bcb08211d531 2024-01-12 11:02:16,312 INFO org.apache.flink.runtime.io.network.netty.NettyConfig [] - NettyConfig [server address: /0.0.0.0, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: AUTO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 2024-01-12 11:02:16,438 INFO org.apache.flink.runtime.io.network.NettyShuffleServiceFactory [] - Created a new FileChannelManager for storing result partitions of BLOCKING shuffles. Used directories: /tmp/flink-netty-shuffle-b698edd9-5b87-4f17-9442-2190641af033 2024-01-12 11:02:16,820 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool [] - Allocated 158 MB for network buffer pool (number of memory segments: 5079, bytes per segment: 32768). 2024-01-12 11:02:16,842 INFO org.apache.flink.runtime.io.network.NettyShuffleEnvironment [] - Starting the network environment and its components. 2024-01-12 11:02:17,029 INFO org.apache.flink.runtime.io.network.netty.NettyClient [] - Transport type 'auto': using EPOLL. 2024-01-12 11:02:17,031 INFO org.apache.flink.runtime.io.network.netty.NettyClient [] - Successful initialization (took 188 ms). 2024-01-12 11:02:17,039 INFO org.apache.flink.runtime.io.network.netty.NettyServer [] - Transport type 'auto': using EPOLL. 2024-01-12 11:02:17,141 INFO org.apache.flink.runtime.io.network.netty.NettyServer [] - Successful initialization (took 108 ms). Listening on SocketAddress /0.0.0.0:42335. 2024-01-12 11:02:17,143 INFO org.apache.flink.runtime.taskexecutor.KvStateService [] - Starting the kvState service and its components. 2024-01-12 11:02:17,236 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/rpc/taskmanager_0 . 2024-01-12 11:02:17,342 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job leader service. 2024-01-12 11:02:17,345 INFO org.apache.flink.runtime.filecache.FileCache [] - User file cache uses directory /tmp/flink-dist-cache-3b3d1cb3-3914-4dd5-a403-216680f25c79 2024-01-12 11:02:17,349 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@basic-example.default:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000). 2024-01-12 11:02:27,441 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@basic-example.default:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@basic-example.default:6123/user/rpc/resourcemanager_*. 2024-01-12 11:02:37,538 INFO akka.remote.transport.ProtocolStateActor [] - No response from remote for outbound association. Associate timed out after [20000 ms]. 2024-01-12 11:02:37,546 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@basic-example.default:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].] (…) ________________________________ L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.