May you provide the entire JobManager logs so that we can see what's going on?
On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jve...@strava.com> wrote: > Thanks again, Matthias! > > Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 > as params for appmaster.sh > I see in tog they seem to transform in the correct values > > -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 > > but a bit later the appmaster dies with this new error. it is unclear what > address it is trying to bind, I added explicit params > -Drest.bind-port=8081 and > -Drest.port=8081 in case jobmanager.rpc.port was somehow > interfering, but that didn't help. > > 2021-09-29 10:29:59.845 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting > MesosSessionClusterEntrypoint down with application status FAILED. > Diagnostics java.net.BindException: Cannot assign requested address > at java.base/sun.nio.ch.Net.bind0(Native Method) > at java.base/sun.nio.ch.Net.bind(Unknown Source) > at java.base/sun.nio.ch.Net.bind(Unknown Source) > at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) > at > org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > > On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <matth...@ververica.com> > wrote: > >> The port has its separate configuration parameter jobmanager.rpc.port [1] >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 >> >> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jve...@strava.com> wrote: >> >>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address >>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves >>> properly to the host IP and port mapped to 8081 >>> >>> 2021-09-29 07:58:05.452 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>> -Djobmanager.rpc.address=10.0.22.114:31894 >>> >>> which is very promising. But sadly a little bit later appmaster dies >>> with this errror: >>> >>> 2021-09-29 07:58:05.648 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing >>> cluster services. >>> 2021-09-29 07:58:05.674 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >>> MesosSessionClusterEntrypoint down with application status FAILED. >>> Diagnostics org.apache.flink.configurati >>> on.IllegalConfigurationException: The configured hostname is not valid >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) >>> at >>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) >>> at java.base/java.security.AccessController.doPrivileged(Native Method) >>> at java.base/javax.security.auth.Subject.doAs(Unknown Source) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) >>> at >>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) >>> at >>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) >>> Caused by: java.lang.IllegalArgumentException >>> at >>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) >>> ... 17 more >>> . >>> 2021-09-29 07:58:05.685 [main] ERROR >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start >>> cluster entrypoint MesosSessionClusterEntrypoint. >>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed >>> to initialize the cluster entrypoint MesosSessionClusterEntrypoint. >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) >>> at >>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) >>> Caused by: org.apache.flink.configuration.IllegalConfigurationException: >>> The configured hostname is not valid >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) >>> at >>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) >>> at java.base/java.security.AccessController.doPrivileged(Native Method) >>> at java.base/javax.security.auth.Subject.doAs(Unknown Source) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) >>> at >>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >>> at >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) >>> ... 2 common frames omitted >>> Caused by: java.lang.IllegalArgumentException: null >>> at >>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) >>> ... 17 common frames omitted >>> >>> >>> >>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <matth...@ververica.com> >>> wrote: >>> >>>> One thing that was puzzling me yesterday when reading your post: Have >>>> you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I >>>> played around with Mesos, I remember using HOST to resolve the host's IP >>>> address instead of the host's name. It could be that the hostname itself >>>> cannot be resolved to the right IP address. But I struggled to find proper >>>> documentation to back that up. Only in the recipes section of the Marathon >>>> docs [1], HOST was used as well. >>>> >>>> Matthias >>>> >>>> [1] >>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks >>>> >>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jve...@strava.com> wrote: >>>> >>>>> Another update: Looking more carefully in my appmaster log, I see the >>>>> following >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> Registering as new framework. >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> ----------------------------------------------------------------------------- >>>>> >>>>> --- >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos >>>>> Info: >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Master >>>>> URL: 10.0.18.246:5050 >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Framework >>>>> Info: >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: >>>>> (none) >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Name: >>>>> flink-test >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> Failover >>>>> Timeout (secs): 604800.0 >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Role: >>>>> * >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> Capabilities: >>>>> (none) >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> Principal: >>>>> (none) >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Host: >>>>> 311dcf7fd77c >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web >>>>> UI: http://311dcf7fd77c:8081 >>>>> >>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >>>>> ----------------------------------------------------------------------------- >>>>> >>>>> --- >>>>> >>>>> >>>>> which is picking up the mesos.master and >>>>> mesos.resourcemanager.framework.name params I am passing to >>>>> mesos-appmaster.sh >>>>> >>>>> >>>>> In my Mesos dashboard I can see the framework has been created with >>>>> the right name, but has no associated agents/tasks to it. So at least >>>>> Flink >>>>> has been able to connect to the Mesos master to create the framework >>>>> >>>>> >>>>> Later in the mesos-appmaster log is when I see the Mesos connection >>>>> errors: >>>>> >>>>> >>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - >>>>> Starting the slot manager. >>>>> >>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG >>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - State change >>>>> (StoppedState -> StoppedState) with data () >>>>> >>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger >>>>> heartbeat request. >>>>> >>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator - State >>>>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0) >>>>> >>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger >>>>> heartbeat request. >>>>> >>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO >>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to >>>>> Mesos... >>>>> >>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - State change >>>>> (StoppedState -> ConnectingState) with data () >>>>> >>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO >>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos >>>>> resource manager started. >>>>> >>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG >>>>> org.apache.flink.mesos.scheduler.LaunchCoordinator - State change >>>>> (Suspended -> Suspended) with data GatherData(List(),List()) >>>>> >>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN >>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to >>>>> connect to Mesos; still trying... >>>>> >>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger >>>>> heartbeat request. >>>>> >>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG >>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger >>>>> heartbeat request. >>>>> >>>>> >>>>> >>>>> >>>>> So why the appmaster was able to connect to Mesos master to create the >>>>> framework but failed to connect later to do whatever it does later? >>>>> >>>>> >>>>> One possible issue I see is that the framework is set with web UI in h >>>>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos >>>>> master. 311dcf7fd77c is the result of doing hostname on the Docker >>>>> container, and the Mesos master can not resolve that name. I could try to >>>>> replace the Docker container hostname with the Docker host hostname, but >>>>> the host port that gets mapped to 8081 on the container is a random port >>>>> that I can not know beforehand. Does Mesos master try to reach Flink using >>>>> that Web UI setting? Could this be the issue causing my connection >>>>> problem, >>>>> or is this a red herring and the problem is a different one? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> Javier Vegas >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jve...@strava.com> >>>>> wrote: >>>>> >>>>>> Thanks, Matthias! >>>>>> >>>>>> There are lots of apps deployed to the Mesos cluster, the task >>>>>> manager itself is deployed to Mesos via Marathon. In the Mesos log I can >>>>>> see the Job manager agent starting, but no error messages related to it. >>>>>> As >>>>>> you say, TaskManagers don't even have the chance to get confused about >>>>>> variables, since the Job Manager can not connect to the Mesos master to >>>>>> tell it to start the Task Managers. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Javier >>>>>> >>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <matth...@ververica.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Javier, >>>>>>> I don't see anything that's configured in the wrong way based on the >>>>>>> jobmanager logs you've provided. Have you been able to deploy other >>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal >>>>>>> anything? The variable resolution on the TaskManager side is a valid >>>>>>> concern shared by Roman since it's easy to run into such an issue. But >>>>>>> the >>>>>>> JobManager logs indicate that the JobManager is not able to contact the >>>>>>> Mesos master. Hence, I'd assume that it's not related to the >>>>>>> TaskManagers >>>>>>> not coming up. >>>>>>> >>>>>>> Best, >>>>>>> Matthias >>>>>>> >>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> No additional ports need to be open as far as I know. >>>>>>>> >>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable on >>>>>>>> TMs? >>>>>>>> >>>>>>>> Please also make sure that the following gets executed before >>>>>>>> mesos-appmaster.sh: >>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath) >>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >>>>>>>> (as per the documentation you linked) >>>>>>>> >>>>>>>> Regards, >>>>>>>> Roman >>>>>>>> >>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jve...@strava.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the >>>>>>>> instrucions in >>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my >>>>>>>> binaries. >>>>>>>> > >>>>>>>> > My entrypoint for the Docker image is: >>>>>>>> > >>>>>>>> > >>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \ >>>>>>>> > >>>>>>>> > -Djobmanager.rpc.address=$HOSTNAME \ >>>>>>>> > >>>>>>>> > -Dmesos.resourcemanager.framework.user=flink \ >>>>>>>> > >>>>>>>> > -Dmesos.master=10.0.18.246:5050 \ >>>>>>>> > >>>>>>>> > -Dmesos.resourcemanager.tasks.cpus=6 >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this: >>>>>>>> > >>>>>>>> > >>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >>>>>>>> > >>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on >>>>>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >>>>>>>> > >>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker >>>>>>>> executor on 10.0.20.177 >>>>>>>> > >>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >>>>>>>> > >>>>>>>> > WARNING: Your kernel does not support swap limit capabilities or >>>>>>>> the cgroup is not mounted. Memory limited without swap. >>>>>>>> > >>>>>>>> > WARNING: An illegal reflective access operation has occurred >>>>>>>> > >>>>>>>> > WARNING: Illegal reflective access by >>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to >>>>>>>> method >>>>>>>> sun.security.krb5.Config.getInstance() >>>>>>>> > >>>>>>>> > WARNING: Please consider reporting this to the maintainers of >>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>>>>>> > >>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further >>>>>>>> illegal reflective access operations >>>>>>>> > >>>>>>>> > WARNING: All illegal access operations will be denied in a future >>>>>>>> release >>>>>>>> > >>>>>>>> > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 >>>>>>>> > >>>>>>>> > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at >>>>>>>> master@10.0.18.246:5050 >>>>>>>> > >>>>>>>> > I0927 16:50:43.624779 328 sched.cpp:356] No credentials >>>>>>>> provided. Attempting to register without authentication >>>>>>>> > >>>>>>>> > >>>>>>>> > where the "New master detected" line is promising. >>>>>>>> > >>>>>>>> > However, on the Flink UI I see only the jobmanager started, and >>>>>>>> there are no task managers. Getting into the Docker container, I see >>>>>>>> this >>>>>>>> in the log: >>>>>>>> > >>>>>>>> > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - >>>>>>>> Unable to connect to Mesos; still trying... >>>>>>>> > >>>>>>>> > >>>>>>>> > I have verified that from the container I can access the Mesos >>>>>>>> container 10.0.18.246:5050 >>>>>>>> > >>>>>>>> > >>>>>>>> > Does any other port besides the web UI port 5050 need to be open >>>>>>>> for mesos-appmaster to connect with the Mesos master? >>>>>>>> > >>>>>>>> > >>>>>>>> > In the appmaster log (attached) I see one exception that I don't >>>>>>>> know if they are related to the Mesos connection problem, one is >>>>>>>> > >>>>>>>> > >>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir >>>>>>>> are unset. >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) >>>>>>>> > >>>>>>>> > at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) >>>>>>>> > >>>>>>>> > at >>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>> Method) >>>>>>>> > >>>>>>>> > at >>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown >>>>>>>> Source) >>>>>>>> > >>>>>>>> > at >>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown >>>>>>>> Source) >>>>>>>> > >>>>>>>> > at java.base/java.lang.reflect.Method.invoke(Unknown >>>>>>>> Source) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) >>>>>>>> > >>>>>>>> > at >>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95) >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > I am not trying (yet) to run in high availability mode, so I am >>>>>>>> not sure if I need to have HADOOP_HOME set or not, but I don't see >>>>>>>> anything >>>>>>>> about HADOOP_HOME in the FLink docs. >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so >>>>>>>> Flink can connect to my Mesos master? >>>>>>>> > >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > >>>>>>>> > >>>>>>>> > Javier Vegas >>>>>>>> > >>>>>>>> > >>>>>>> >>>>>>> >>>>