Thanks for sharing. I was wondering why you don't use $PORT0 in your command. And: Are the ports properly configured in the Marathon network configuration [1]? But the error seems to be unrelated to that setting. Other than that, I cannot see any other issue with the configuration. It could be that the HOST IP is blocked?
[1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas <jve...@strava.com> wrote: > > Full appmaster log in debug mode is attached. > My startup command was > /opt/flink/bin/mesos-appmaster.sh \ > -Drest.bind-port=8081 \ > -Drest.port=8081 \ > -Djobmanager.rpc.address=$HOST \ > -Djobmanager.rpc.port=$PORT1 \ > -Dmesos.resourcemanager.framework.user=flink \ > -Dmesos.resourcemanager.framework.name=timeline-flink-populator \ > -Dmesos.master=10.0.18.246:5050 \ > -Dmesos.resourcemanager.tasks.cpus=4 \ > -Dmesos.resourcemanager.tasks.container.type=docker \ > -Dmesos.resourcemanager.tasks.container.image.name= > docker.strava.com/strava/timeline-populator2:jv-mesos \ > -Dtaskmanager.numberOfTaskSlots=4 ; > > where $PORT1 refers to my second host open port, mapped to 6123 on the > Docker container (first port is mapped to 8081). > I can see in the log that $HOST and $PORT1 resolve to the correct values, > 10.0.20.25 > and 31608 > > On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl <matth...@ververica.com> > wrote: > >> ...and if possible, it would be helpful to provide debug logs as well. >> >> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <matth...@ververica.com> >> wrote: >> >>> May you provide the entire JobManager logs so that we can see what's >>> going on? >>> >>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jve...@strava.com> wrote: >>> >>>> Thanks again, Matthias! >>>> >>>> Putting -Djobmanager.rpc.address=$HOST and >>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh >>>> I see in tog they seem to transform in the correct values >>>> >>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 >>>> >>>> but a bit later the appmaster dies with this new error. it is unclear >>>> what address it is trying to bind, I added explicit params >>>> -Drest.bind-port=8081 and >>>> -Drest.port=8081 in case jobmanager.rpc.port was somehow >>>> interfering, but that didn't help. >>>> >>>> 2021-09-29 10:29:59.845 [main] INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >>>> MesosSessionClusterEntrypoint down with application status FAILED. >>>> Diagnostics java.net.BindException: Cannot assign requested address >>>> at java.base/sun.nio.ch.Net.bind0(Native Method) >>>> at java.base/sun.nio.ch.Net.bind(Unknown Source) >>>> at java.base/sun.nio.ch.Net.bind(Unknown Source) >>>> at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >>>> at java.base/java.lang.Thread.run(Unknown Source) >>>> >>>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <matth...@ververica.com> >>>> wrote: >>>> >>>>> The port has its separate configuration parameter jobmanager.rpc.port >>>>> [1] >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 >>>>> >>>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jve...@strava.com> >>>>> wrote: >>>>> >>>>>> Matthias, thanks for the suggestion! I changed my >>>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the >>>>>> log I see resolves properly to the host IP and port mapped to 8081 >>>>>> >>>>>> 2021-09-29 07:58:05.452 [main] INFO >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>>> -Djobmanager.rpc.address=10.0.22.114:31894 >>>>>> >>>>>> which is very promising. But sadly a little bit later appmaster dies >>>>>> with this errror: >>>>>> >>>>>> 2021-09-29 07:58:05.648 [main] INFO >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing >>>>>> cluster services. >>>>>> 2021-09-29 07:58:05.674 [main] INFO >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >>>>>> MesosSessionClusterEntrypoint down with application status FAILED. >>>>>> Diagnostics org.apache.flink.configurati >>>>>> on.IllegalConfigurationException: The configured hostname is not valid >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >>>>>> at >>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >>>>>> at >>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) >>>>>> at >>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) >>>>>> at java.base/java.security.AccessController.doPrivileged(Native >>>>>> Method) >>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source) >>>>>> at >>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) >>>>>> at >>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) >>>>>> at >>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) >>>>>> Caused by: java.lang.IllegalArgumentException >>>>>> at >>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) >>>>>> ... 17 more >>>>>> . >>>>>> 2021-09-29 07:58:05.685 [main] ERROR >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start >>>>>> cluster entrypoint MesosSessionClusterEntrypoint. >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: >>>>>> Failed to initialize the cluster entrypoint >>>>>> MesosSessionClusterEntrypoint. >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) >>>>>> at >>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) >>>>>> Caused by: >>>>>> org.apache.flink.configuration.IllegalConfigurationException: The >>>>>> configured hostname is not valid >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >>>>>> at >>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >>>>>> at >>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) >>>>>> at >>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) >>>>>> at java.base/java.security.AccessController.doPrivileged(Native >>>>>> Method) >>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source) >>>>>> at >>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) >>>>>> at >>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >>>>>> at >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) >>>>>> ... 2 common frames omitted >>>>>> Caused by: java.lang.IllegalArgumentException: null >>>>>> at >>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) >>>>>> at >>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) >>>>>> ... 17 common frames omitted >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl < >>>>>> matth...@ververica.com> wrote: >>>>>> >>>>>>> One thing that was puzzling me yesterday when reading your post: >>>>>>> Have you tried $HOST instead of $HOSTNAME in the Marathon configuration? >>>>>>> When I played around with Mesos, I remember using HOST to resolve the >>>>>>> host's IP address instead of the host's name. It could be that the >>>>>>> hostname >>>>>>> itself cannot be resolved to the right IP address. But I struggled to >>>>>>> find >>>>>>> proper documentation to back that up. Only in the recipes section of the >>>>>>> Marathon docs [1], HOST was used as well. >>>>>>> >>>>>>> Matthias >>>>>>> >>>>>>> [1] >>>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks >>>>>>> >>>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jve...@strava.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Another update: Looking more carefully in my appmaster log, I see >>>>>>>> the following >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Registering as new framework. >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> ----------------------------------------------------------------------------- >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos >>>>>>>> Info: >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Master >>>>>>>> URL: 10.0.18.246:5050 >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Framework >>>>>>>> Info: >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: >>>>>>>> (none) >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Name: >>>>>>>> flink-test >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Failover >>>>>>>> Timeout (secs): 604800.0 >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Role: >>>>>>>> * >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Capabilities: >>>>>>>> (none) >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Principal: >>>>>>>> (none) >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Host: >>>>>>>> 311dcf7fd77c >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web >>>>>>>> UI: http://311dcf7fd77c:8081 >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> ----------------------------------------------------------------------------- >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> >>>>>>>> which is picking up the mesos.master and >>>>>>>> mesos.resourcemanager.framework.name params I am passing to >>>>>>>> mesos-appmaster.sh >>>>>>>> >>>>>>>> >>>>>>>> In my Mesos dashboard I can see the framework has been created with >>>>>>>> the right name, but has no associated agents/tasks to it. So at least >>>>>>>> Flink >>>>>>>> has been able to connect to the Mesos master to create the framework >>>>>>>> >>>>>>>> >>>>>>>> Later in the mesos-appmaster log is when I see the Mesos connection >>>>>>>> errors: >>>>>>>> >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - >>>>>>>> Starting the slot manager. >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] >>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor - State >>>>>>>> change (StoppedState -> StoppedState) with data () >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - >>>>>>>> Trigger heartbeat request. >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator - >>>>>>>> State change (Suspended -> Suspended) with data >>>>>>>> ReconciliationData(Map(),0) >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - >>>>>>>> Trigger heartbeat request. >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting >>>>>>>> to Mesos... >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor - State >>>>>>>> change (StoppedState -> ConnectingState) with data () >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO >>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>>>>> Mesos resource manager started. >>>>>>>> >>>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] >>>>>>>> DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator - State >>>>>>>> change (Suspended -> Suspended) with data GatherData(List(),List()) >>>>>>>> >>>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN >>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to >>>>>>>> connect to Mesos; still trying... >>>>>>>> >>>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - >>>>>>>> Trigger heartbeat request. >>>>>>>> >>>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] >>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - >>>>>>>> Trigger heartbeat request. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> So why the appmaster was able to connect to Mesos master to create >>>>>>>> the framework but failed to connect later to do whatever it does later? >>>>>>>> >>>>>>>> >>>>>>>> One possible issue I see is that the framework is set with web UI >>>>>>>> in http://311dcf7fd77c:8081 which can not be resolved from the >>>>>>>> Mesos master. 311dcf7fd77c is the result of doing hostname on the >>>>>>>> Docker container, and the Mesos master can not resolve that name. I >>>>>>>> could >>>>>>>> try to replace the Docker container hostname with the Docker host >>>>>>>> hostname, >>>>>>>> but the host port that gets mapped to 8081 on the container is a random >>>>>>>> port that I can not know beforehand. Does Mesos master try to reach >>>>>>>> Flink >>>>>>>> using that Web UI setting? Could this be the issue causing my >>>>>>>> connection >>>>>>>> problem, or is this a red herring and the problem is a different one? >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> >>>>>>>> Javier Vegas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jve...@strava.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks, Matthias! >>>>>>>>> >>>>>>>>> There are lots of apps deployed to the Mesos cluster, the task >>>>>>>>> manager itself is deployed to Mesos via Marathon. In the Mesos log I >>>>>>>>> can >>>>>>>>> see the Job manager agent starting, but no error messages related to >>>>>>>>> it. As >>>>>>>>> you say, TaskManagers don't even have the chance to get confused about >>>>>>>>> variables, since the Job Manager can not connect to the Mesos master >>>>>>>>> to >>>>>>>>> tell it to start the Task Managers. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Javier >>>>>>>>> >>>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl < >>>>>>>>> matth...@ververica.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Javier, >>>>>>>>>> I don't see anything that's configured in the wrong way based on >>>>>>>>>> the jobmanager logs you've provided. Have you been able to deploy >>>>>>>>>> other >>>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal >>>>>>>>>> anything? The variable resolution on the TaskManager side is a valid >>>>>>>>>> concern shared by Roman since it's easy to run into such an issue. >>>>>>>>>> But the >>>>>>>>>> JobManager logs indicate that the JobManager is not able to contact >>>>>>>>>> the >>>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the >>>>>>>>>> TaskManagers >>>>>>>>>> not coming up. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Matthias >>>>>>>>>> >>>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan < >>>>>>>>>> ro...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> No additional ports need to be open as far as I know. >>>>>>>>>>> >>>>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable >>>>>>>>>>> on TMs? >>>>>>>>>>> >>>>>>>>>>> Please also make sure that the following gets executed before >>>>>>>>>>> mesos-appmaster.sh: >>>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath) >>>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >>>>>>>>>>> (as per the documentation you linked) >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Roman >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jve...@strava.com> >>>>>>>>>>> wrote: >>>>>>>>>>> > >>>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the >>>>>>>>>>> instrucions in >>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >>>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and >>>>>>>>>>> my >>>>>>>>>>> binaries. >>>>>>>>>>> > >>>>>>>>>>> > My entrypoint for the Docker image is: >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \ >>>>>>>>>>> > >>>>>>>>>>> > -Djobmanager.rpc.address=$HOSTNAME \ >>>>>>>>>>> > >>>>>>>>>>> > -Dmesos.resourcemanager.framework.user=flink \ >>>>>>>>>>> > >>>>>>>>>>> > -Dmesos.master=10.0.18.246:5050 \ >>>>>>>>>>> > >>>>>>>>>>> > -Dmesos.resourcemanager.tasks.cpus=6 >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this: >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered >>>>>>>>>>> on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered >>>>>>>>>>> docker executor on 10.0.20.177 >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >>>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >>>>>>>>>>> > >>>>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities >>>>>>>>>>> or the cgroup is not mounted. Memory limited without swap. >>>>>>>>>>> > >>>>>>>>>>> > WARNING: An illegal reflective access operation has occurred >>>>>>>>>>> > >>>>>>>>>>> > WARNING: Illegal reflective access by >>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to >>>>>>>>>>> method >>>>>>>>>>> sun.security.krb5.Config.getInstance() >>>>>>>>>>> > >>>>>>>>>>> > WARNING: Please consider reporting this to the maintainers of >>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>>>>>>>>> > >>>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of >>>>>>>>>>> further illegal reflective access operations >>>>>>>>>>> > >>>>>>>>>>> > WARNING: All illegal access operations will be denied in a >>>>>>>>>>> future release >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:43.624439 328 sched.cpp:336] New master detected >>>>>>>>>>> at master@10.0.18.246:5050 >>>>>>>>>>> > >>>>>>>>>>> > I0927 16:50:43.624779 328 sched.cpp:356] No credentials >>>>>>>>>>> provided. Attempting to register without authentication >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > where the "New master detected" line is promising. >>>>>>>>>>> > >>>>>>>>>>> > However, on the Flink UI I see only the jobmanager started, >>>>>>>>>>> and there are no task managers. Getting into the Docker container, >>>>>>>>>>> I see >>>>>>>>>>> this in the log: >>>>>>>>>>> > >>>>>>>>>>> > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - >>>>>>>>>>> Unable to connect to Mesos; still trying... >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > I have verified that from the container I can access the Mesos >>>>>>>>>>> container 10.0.18.246:5050 >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Does any other port besides the web UI port 5050 need to be >>>>>>>>>>> open for mesos-appmaster to connect with the Mesos master? >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > In the appmaster log (attached) I see one exception that I >>>>>>>>>>> don't know if they are related to the Mesos connection problem, one >>>>>>>>>>> is >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir >>>>>>>>>>> are unset. >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.util.Shell.<clinit>(Shell.java:496) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>>>>> Method) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown >>>>>>>>>>> Source) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown >>>>>>>>>>> Source) >>>>>>>>>>> > >>>>>>>>>>> > at java.base/java.lang.reflect.Method.invoke(Unknown >>>>>>>>>>> Source) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) >>>>>>>>>>> > >>>>>>>>>>> > at >>>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95) >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I >>>>>>>>>>> am not sure if I need to have HADOOP_HOME set or not, but I don't >>>>>>>>>>> see >>>>>>>>>>> anything about HADOOP_HOME in the FLink docs. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment >>>>>>>>>>> so Flink can connect to my Mesos master? >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Thanks, >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Javier Vegas >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>