Re: Flink -mesos-app master hang

Till Rohrmann Fri, 04 Aug 2017 03:18:39 -0700

Hi Biswajit,

are there any Mesos logs which might help us pinpointing the problem? I've
actually never run Flink on Mesos with Docker images. But it could be that
Flink does not set things properly up for running Docker images. I'll try
to run Flink based on Docker images over the weekend in order to see
whether I can reproduce the problem.


Cheers,
Till

On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <biswajit...@gmail.com> wrote:

> Hi There,
>
> I have posted this here in the group a few days back and after that I have
> been exchanging email with Eron, thanks to Eron for all the tips.  Now  I
> see this basic auth error, I'm little confused how come Job Manager
> launched fine and task manager failing to auth.
> Also, mesos doc says by default authenticate is false so it should not
> have gone there,  do I have to disable somewhere inside flink ??? I don't
> see any config or property in code.
>
> This is kind of blocker for me now for mesos deployment , really
> appreciate for any inputs/suggestion
>
> ~ Biswajit
>
> ---------- Forwarded message ----------
> From: Eron Wright <ewri...@live.com>
> Date: Wed, Aug 2, 2017 at 10:51 AM
> ------------------------------
> *From:* Biswajit Das <biswajit...@gmail.com>
> *Sent:* Wednesday, August 2, 2017 10:19:45 AM
> *To:* Eron Wright
> *Subject:* Re: Flink -mesos-app master hang
>
> Hi Eron ,
>
> Good morning , I'm really sorry for flooding question . I'll post this one
> to user group also .
> I could narrow down the actual error thrown by mesos , seems like JM some
> how not able to authenticate . I'm little confused if it is *docker
> private registry tls error *or some thing else , I have started slave
> even with --docker_config , previously mostly I was using  docker.tar.gz
> with container for private repo authentication .
>
> 017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.schedul
> er.TaskMonitor                  - Mesos task taskmanager-00003 failed
> unexpectedly.
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager * - Mesos task
> taskmanager-00003 failed, with a TaskManager in launch or registration.
> State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch
> container: Unexpected WWW-Authenticate header format: 'Basic
> realm="Registry Realm"')*
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
> taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
> message=Failed to launch container: Unexpected WWW-Authenticate header
> format: 'Basic realm="Registry Realm"'
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Total number of failed
> tasks so far: 3
> 2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
> because the number of failed tasks (3) exceeded the maximum failed tasks
> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
> configuration setting. By default its the number of requested tasks.
> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster with
> status FAILED : Stopping Mesos session because the number of failed tasks
> (3) exceeded the maximum failed tasks (2). This number is controlled by the
> 'mesos.maximum-failed-tasks' configuration setting. By default its the
> number of requested tasks.
> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Shutting down and
> unregistering as a Mesos framework.
> 2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
> master
> root@ip-172-31-4-44:/etc/me
>
> On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ewri...@live.com> wrote:
>
>> I think you're on the right track, in trying to configure the docker
>> image provider.  This is on Linux right, and you definitely restarted the
>> agents?
>>
>>
>> An important difference between the JM and the TM is that the JM is a
>> task launched by the Marathon framework, whereas the TM is a task launched
>> by the JM framework.  The respective configurations and behaviors are
>> different.   For example, I see that Marathon is launching the JM with the
>> Docker containerizer, whereas the JS is launching the TM with the Mesos
>> containerizer (with Docker image provider support).     The Mesos
>> containerizer is more modern and preferred, and I don't think Flink
>> supports anything else.
>>
>>
>> The doc I linked to shows how to launch a docker image-based container
>> with mesos-execute.   Using mesos-execute to verify your cluster
>> configuration is a good idea, to isolate any issue.  For example, see if
>> you can launch a container using the Mesos containerizer and the Docker
>> image provider, executing a simple command such as 'sleep'.
>>
>>
>> Eron
>> ------------------------------
>> *From:* Biswajit Das <biswajit...@gmail.com>
>> *Sent:* Tuesday, August 1, 2017 10:02:51 AM
>> *To:* Eron Wright
>>
>> *Subject:* Re: Flink -mesos-app master hang
>>
>> Hi Eron ,
>>
>> Thank you for the email , I really appreciate your reply.
>>
>> That's what is confusing me. I have been running mesos with container
>> both on staging and production for almost a year now with mostly
>> spark/presto load everything containerize fairly big cluster. .. Here is
>> one of my slave config . One interesting part here is ,  app master is
>> launched and I can access job manager web UI from mesos frame work , I can
>> also see it is registered itself as `flink` framework . The only thing I'm
>> seeing task manager is showing `0` . I have asked to create 2 instance
>>
>>
>> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos
>> --attributes=environment:dev;agent_role:generic 
>> *--containerizers=docker,mesos
>> * --executor_registration_timeout=10mins --hostname=XXX 
>> *--image_providers=appc,docker
>> --ip=XXX --isolation=filesystem/linux,docker/runtime*
>> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos
>>
>>
>> Previously I never had *--image_providers and --isolation* , after
>> seeing this error I have added this two but not much help , I'm running on
>> ubuntu /mesos 1.1.0 and submitting the job with marathon ..
>>
>>
>> I have tried with toggling mesos debug log , not much info ...other hen
>> git signal to kill the framework ..
>>
>> marathon json task
>>
>>> {
>>>   "id": "/flink-app-master",
>>>   "cmd": null,
>>>   "cpus": 2,
>>>   "mem": 4096,
>>>   "disk": 10000,
>>>   "instances": 1,
>>>   "constraints": [
>>>     [
>>>       "hostname",
>>>       "LIKE",
>>>       "xxx" ->>> restricited to some host for debugging as I have fairly
>>> big cluster
>>>     ]
>>>   ],
>>>   "acceptedResourceRoles": [
>>>     "*"
>>>   ],
>>>   "container": {
>>>     "type": "DOCKER",
>>>     "volumes": [],
>>>     "docker": {
>>>       "image": "docker.xx.xx/flink:1.8.0",
>>>       "network": "HOST",
>>>       "portMappings": [],
>>>       "privileged": false,
>>>       "parameters": [],
>>>       "forcePullImage": false
>>>     }
>>>   },
>>>   "env": {
>>>     "MESOS_MASTER": "zk://XX/mesos"
>>>   },
>>>   "portDefinitions": [
>>>     {
>>>       "port": 9081,
>>>       "protocol": "tcp",
>>>       "name": "default",
>>>       "labels": {}
>>>     }
>>>   ],
>>>   "uris": [
>>>     "file:///etc/docker.tar.gz"
>>>   ],
>>>   "fetch": [
>>>     {
>>>       "uri": "file:///etc/docker.tar.gz",
>>>       "extract": true,
>>>       "executable": false,
>>>       "cache": false
>>>     }
>>>   ]
>>> }
>>>
>>
>> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ewri...@live.com> wrote:
>>
>>> From the error message it seems that your Mesos cluster doesn't have the
>>> docker image provisioner installed.   The message originates from Mesos
>>> anyway so the problem lies there.   Note that docker image support is
>>> provided in Linux only.  You can also use the Flink on Mesos support
>>> without images, if you make sure that JAVA_HOME is set on all executors.
>>>
>>> Hope this helps!
>>>
>>> http://mesos.apache.org/documentation/latest/container-image/
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> From: Biswajit Das
>>> Sent: Tuesday, August 1, 1:24 AM
>>> Subject: Re: Flink -mesos-app master hang
>>> To: ewri...@live.com
>>>
>>>
>>> Hi Eron ,  I have came across some of your comment in JIRA and wanted to
>>> clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me
>>> to look ..
>>>
>>>
>>> -----------------------------------------------
>>> 2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.schedul
>>> er.LaunchCoordinator            - Waiting for more offers; 1 task(s)
>>> are not yet launched.
>>> 2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Launching Mesos task
>>> taskmanager-00039 on host 172.31.5.212.
>>> 2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.schedul
>>> er.TaskMonitor                  - Mesos task taskmanager-00039 failed
>>> unexpectedly.
>>> *2017-08-01 07:26:34,733 INFO
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
>>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or
>>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED
>>> (Failed to launch container: Unsupported container image type: DOCKER)*
>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>>> taskmanager-00039 in state TASK_FAILED : 
>>> reason=REASON_CONTAINER_LAUNCH_FAILED
>>> message=Failed to launch container: Unsupported container image type: DOCKER
>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>>> tasks so far: 3
>>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>>> because the number of failed tasks (3) exceeded the maximum failed tasks
>>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>>> configuration setting. By default its the number of requested tasks.
>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>>> with status FAILED : Stopping Mesos session because the number of failed
>>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>>> the number of requested tasks.
>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>>> unregistering as a Mesos framework.
>>> 2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>>> master
>>> 2017-08-01 07:26:34,745 INFO  org.apache.f
>>> ---------------------------------------------------
>>>
>>> Thank you in advance .
>>> ~Biswajit
>>>
>>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <biswajit...@gmail.com>
>>> wrote:
>>>
>>> Hi All,
>>> I'm trying to run a flink docker from the marathon with mesos app
>>> master; I could see it goes on a continuous loop and failed to launch the
>>> task manger. If I go to mesos master UI I could see job manager web UI with
>>> task manager zero .
>>>
>>> I have pretty much checked every possible log starting from Ubuntu
>>> machine docker.log /mesos master/slave  pretty much no information other
>>> than just failed task , I could see below log @ flink . However, I'm able
>>> to run same docker image if I run jobamanger and taskmanager by itself in
>>> marathon and let it connect via jobmanager RPC port .
>>>
>>> for mesos config , I'm using below details from yml
>>> mesos.master: ${MESOS_MASTER}
>>> mesos.failover-timeout: 60
>>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
>>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
>>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
>>> mesos.resourcemanager.tasks.container.type: docker
>>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}
>>>
>>> ---------------------------
>>> 07-30 02:05:48,351 WARN  org.apache.flink.mesos.schedul
>>> er.TaskMonitor                  - Mesos task taskmanager-00002 failed
>>> unexpectedly.
>>> 2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Mesos task
>>> taskmanager-00002 failed, with a TaskManager in launch or registration.
>>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited
>>> with status 127)
>>> -----------------------------------------------------
>>>
>>> Please let me know if any one has any pointer to debug further ..
>>>
>>>
>>> ~ Biswajit
>>>
>>>
>>>
>>>
>>>
>>
>
>

Re: Flink -mesos-app master hang

Reply via email to