Thanks, Matthias!

There are lots of apps deployed to the Mesos cluster, the task manager
itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
Job manager agent starting, but no error messages related to it. As you
say, TaskManagers don't even have the chance to get confused about
variables, since the Job Manager can not connect to the Mesos master to
tell it to start the Task Managers.

Thanks,

Javier

On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <matth...@ververica.com>
wrote:

> Hi Javier,
> I don't see anything that's configured in the wrong way based on the
> jobmanager logs you've provided. Have you been able to deploy other
> applications to this Mesos cluster? Do the Mesos master logs reveal
> anything? The variable resolution on the TaskManager side is a valid
> concern shared by Roman since it's easy to run into such an issue. But the
> JobManager logs indicate that the JobManager is not able to contact the
> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
> not coming up.
>
> Best,
> Matthias
>
> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
> wrote:
>
>> Hi,
>>
>> No additional ports need to be open as far as I know.
>>
>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>
>> Please also make sure that the following gets executed before
>> mesos-appmaster.sh:
>> export HADOOP_CLASSPATH=$(hadoop classpath)
>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>> (as per the documentation you linked)
>>
>> Regards,
>> Roman
>>
>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jve...@strava.com> wrote:
>> >
>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>> and using Marathon to deploy a Docker image with both the Flink and my
>> binaries.
>> >
>> > My entrypoint for the Docker image is:
>> >
>> >
>> > /opt/flink/bin/mesos-appmaster.sh \
>> >
>> >       -Djobmanager.rpc.address=$HOSTNAME \
>> >
>> >       -Dmesos.resourcemanager.framework.user=flink \
>> >
>> >       -Dmesos.master=10.0.18.246:5050 \
>> >
>> >       -Dmesos.resourcemanager.tasks.cpus=6
>> >
>> >
>> >
>> > When mesos-appmaster.sh starts, in the stderr I see this:
>> >
>> >
>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>> >
>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
>> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>> >
>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>> executor on 10.0.20.177
>> >
>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>> >
>> > WARNING: Your kernel does not support swap limit capabilities or the
>> cgroup is not mounted. Memory limited without swap.
>> >
>> > WARNING: An illegal reflective access operation has occurred
>> >
>> > WARNING: Illegal reflective access by
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>> sun.security.krb5.Config.getInstance()
>> >
>> > WARNING: Please consider reporting this to the maintainers of
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> >
>> > WARNING: Use --illegal-access=warn to enable warnings of further
>> illegal reflective access operations
>> >
>> > WARNING: All illegal access operations will be denied in a future
>> release
>> >
>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>> >
>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>> master@10.0.18.246:5050
>> >
>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>> Attempting to register without authentication
>> >
>> >
>> > where the "New master detected" line is promising.
>> >
>> > However, on the Flink UI I see only the jobmanager started, and there
>> are no task managers.  Getting into the Docker container, I see this in the
>> log:
>> >
>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>> connect to Mesos; still trying...
>> >
>> >
>> > I have verified that from the container I can access the Mesos
>> container 10.0.18.246:5050
>> >
>> >
>> > Does any other port besides the web UI port 5050 need to be open for
>> mesos-appmaster to connect with the Mesos master?
>> >
>> >
>> > In the appmaster log (attached) I see one exception that I don't know
>> if they are related to the Mesos connection problem, one is
>> >
>> >
>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>> unset.
>> >
>> >         at
>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>> >
>> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>> >
>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>> >
>> >         at
>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>> >
>> >         at
>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>> >
>> >         at
>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>> >
>> >         at
>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>> >
>> >         at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> >
>> >         at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>> Source)
>> >
>> >         at
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>> Source)
>> >
>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>> >
>> >         at
>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>> >
>> >         at
>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>> >
>> >         at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>> >
>> >
>> >
>> >
>> > I am not trying (yet) to run in high availability mode, so I am not
>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>> about HADOOP_HOME in the FLink docs.
>> >
>> >
>> >
>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink
>> can connect to my Mesos master?
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Javier Vegas
>> >
>> >
>
>

Reply via email to