Thanks Gary, I am compiling a new version of Mesos and when I test it again I will reply here if I found an error.
On Wed, 11 Sep 2019, 09:22 Gary Yao, <g...@ververica.com> wrote: > Hi Felipe, > > I am glad that you were able to fix the problem yourself. > > > But I suppose that Mesos will allocate Slots and Task Managers > dynamically. > > Is that right? > > Yes, that is the case since Flink 1.5 [1]. > > > Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal > or > > less the available cores on a single node of the cluster. I am not sure > about > > this parameter, but only after this configuration it worked. > > I would need to see JobManager and Mesos logs to understand why this > resolved > your issue. If you do not set mesos.resourcemanager.tasks.cpus explicitly, > Flink will request CPU resources equal to the number of TaskManager slots > (taskmanager.numberOfTaskSlots) [2]. Maybe this value was too high in your > configuration? > > Best, > Gary > > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 > [2] > https://github.com/apache/flink/blob/0a405251b297109fde1f9a155eff14be4d943887/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L344 > > On Tue, Sep 10, 2019 at 10:41 AM Felipe Gutierrez < > felipe.o.gutier...@gmail.com> wrote: > >> I managed to find what was going wrong. I will write here just for the >> record. >> >> First, the master machine was not login automatically at itself. So I had >> to give permission for it. >> >> chmod og-wx ~/.ssh/authorized_keys >> chmod 750 $HOME >> >> Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal >> or less the available cores on a single node of the cluster. I am not sure >> about this parameter, but only after this configuration it worked. >> >> Felipe >> *--* >> *-- Felipe Gutierrez* >> >> *-- skype: felipe.o.gutierrez* >> *--* *https://felipeogutierrez.blogspot.com >> <https://felipeogutierrez.blogspot.com>* >> >> >> On Fri, Sep 6, 2019 at 10:36 AM Felipe Gutierrez < >> felipe.o.gutier...@gmail.com> wrote: >> >>> Hi, >>> >>> I am running Mesos without DC/OS [1] and Flink on it. Whe I start my >>> cluster I receive some messages that I suppose everything was started. >>> However, I see 0 slats available on the Flink web dashboard. But I suppose >>> that Mesos will allocate Slots and Task Managers dynamically. Is that right? >>> >>> $ ./bin/mesos-appmaster.sh & >>> [1] 16723 >>> flink@r03:~/flink-1.9.0$ I0906 10:22:45.080328 16943 sched.cpp:239] >>> Version: 1.9.0 >>> I0906 10:22:45.082672 16996 sched.cpp:343] New master detected at >>> mas...@xxx.xxx.xxx.xxx:5050 >>> I0906 10:22:45.083276 16996 sched.cpp:363] No credentials provided. >>> Attempting to register without authentication >>> I0906 10:22:45.086840 16997 sched.cpp:751] Framework registered with >>> 22f6a553-e8ac-42d4-9a90-96a8d5f002f0-0003 >>> >>> Then I deploy my Flink application. When I use the first command to >>> deploy the application starts. However, the tasks remain CREATED until >>> Flink throws a timeout exception. In other words, it never turns to RUNNING. >>> When I use the second comman to deploy the application it does not start >>> and I receive the exception of "Could not allocate all requires slots >>> within timeout of 300000 ms. Slots required: 2". The full stacktrace is >>> below. >>> >>> $ /home/flink/flink-1.9.0/bin/flink run >>> /home/flink/hello-flink-mesos/target/hello-flink-mesos.jar & >>> $ ./bin/mesos-appmaster-job.sh run >>> /home/flink/hello-flink-mesos/target/hello-flink-mesos.jar & >>> >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/mesos.html#mesos-without-dcos >>> ps.: my application runs normally on a standalone Flink cluster. >>> >>> ------------------------------------------------------------ >>> The program finished with the following exception: >>> >>> org.apache.flink.client.program.ProgramInvocationException: Job failed. >>> (JobID: 7ad8d71faaceb1ac469353452c43dc2a) >>> at >>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262) >>> at >>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) >>> at >>> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) >>> at >>> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) >>> at org.hello_flink_mesos.App.<init>(App.java:35) >>> at org.hello_flink_mesos.App.main(App.java:285) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at >>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) >>> at >>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) >>> at >>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) >>> at >>> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) >>> at >>> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) >>> at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) >>> at >>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) >>> at >>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:422) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) >>> at >>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) >>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job >>> execution failed. >>> at >>> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) >>> at >>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) >>> ... 22 more >>> Caused by: >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>> required: 2, slots allocated: 0, previous allocation IDs: [], execution >>> status: completed exceptionally: java.util.concurrent.CompletionException: >>> java.util.concurrent.CompletionException: >>> java.util.concurrent.TimeoutException/java.util.concurrent.CompletableFuture@b520de3[Completed >>> exceptionally], incomplete: >>> java.util.concurrent.CompletableFuture@36f3d30c[Not >>> completed, 1 dependents] >>> at >>> org.apache.flink.runtime.executiongraph.SchedulingUtils.lambda$scheduleEager$1(SchedulingUtils.java:194) >>> at >>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) >>> at >>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> at >>> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:633) >>> at >>> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.lambda$new$0(FutureUtils.java:656) >>> at >>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >>> at >>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> at >>> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) >>> at >>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >>> at >>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> at >>> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:700) >>> at >>> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:484) >>> at >>> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:380) >>> at >>> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) >>> at >>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> at >>> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:998) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) >>> at >>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at >>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >>> Thanks, >>> Felipe >>> *--* >>> *-- Felipe Gutierrez* >>> >>> *-- skype: felipe.o.gutierrez* >>> *--* *https://felipeogutierrez.blogspot.com >>> <https://felipeogutierrez.blogspot.com>* >>> >>