Hi Dongwon, I am not familiar with the deployment on DC/OS. However, Eron Wright and Jörg Schad (cc'd), who have worked on the Mesos integration, might be able to help you.
Best, Gary On Tue, Jan 9, 2018 at 10:29 AM, Dongwon Kim <eastcirc...@gmail.com> wrote: > Hi, > > I've launched JobManager and TaskManager on DC/OS successfully. > Now I have two new issues: > > 1) All TaskManagers are scheduled on a single node. > - Is it intended to maximize data locality and minimize network > communication cost? > - Is there an option in Flink to adjust the behavior of JobManager when it > considers multiple resource offers from different Mesos agents? > - I want to schedule TaskManager processes on different GPU servers so > that each TaskManger process can use its own GPU cards exclusively. > - Below is a part of JobManager log that is occurring while JobManager is > negotiating resources with the Mesos master: > > 2018-01-09 07:34:54,872 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosJobManager - JobManager > akka.tcp://flink@dnn-g08-233:18026/user/jobmanager was granted leadership > with leader session ID Some(00000000-0000-0000-0000-000000000000). > 2018-01-09 07:34:55,889 INFO > org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to > Mesos... > 2018-01-09 07:34:55,962 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Trying to associate with JobManager leader > akka.tcp://flink@dnn-g08-233:18026/user/jobmanager > 2018-01-09 07:34:55,977 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Resource Manager associating with leading JobManager > Actor[akka://flink/user/jobmanager#-1481183359] - leader session > 00000000-0000-0000-0000-000000000000 > 2018-01-09 07:34:56,479 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Scheduling Mesos task taskmanager-00001 with (10240.0 MB, 8.0 cpus). > 2018-01-09 07:34:56,481 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Scheduling Mesos task taskmanager-00002 with (10240.0 MB, 8.0 cpus). > 2018-01-09 07:34:56,481 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Scheduling Mesos task taskmanager-00003 with (10240.0 MB, 8.0 cpus). > 2018-01-09 07:34:56,481 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Scheduling Mesos task taskmanager-00004 with (10240.0 MB, 8.0 cpus). > 2018-01-09 07:34:56,481 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Scheduling Mesos task taskmanager-00005 with (10240.0 MB, 8.0 cpus). > 2018-01-09 07:34:56,483 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Now gathering > offers for at least 5 task(s). > 2018-01-09 07:34:56,484 INFO > org.apache.flink.mesos.scheduler.ConnectionMonitor - Connected to > Mesos as framework ID 59b85b42-a4a2-4632-9578-9e480585ecdc-0004. > 2018-01-09 07:34:56,690 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Received > offer(s) of 606170.0 MB, 234.2 cpus: > 2018-01-09 07:34:56,692 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 111186.0 MB, > 45.9 cpus for [*] > 2018-01-09 07:34:56,692 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of 123506.0 MB, > 47.3 cpus for [*] > 2018-01-09 07:34:56,692 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of 124530.0 MB, > 46.6 cpus for [*] > 2018-01-09 07:34:56,692 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2177 from 50.1.100.231 of 123474.0 MB, > 47.2 cpus for [*] > 2018-01-09 07:34:56,693 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of 123474.0 MB, > 47.2 cpus for [*] > 2018-01-09 07:34:57,711 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Processing 5 > task(s) against 5 new offer(s) plus outstanding offers. > 2018-01-09 07:34:57,726 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Resources > considered: (note: expired offers not deducted from below) > 2018-01-09 07:34:57,727 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 50.1.100.234 has 124530.0 MB, 46.6 cpus > 2018-01-09 07:34:57,728 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 50.1.100.235 has 123506.0 MB, 47.3 cpus > 2018-01-09 07:34:57,728 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 50.1.100.232 has 123474.0 MB, 47.2 cpus > 2018-01-09 07:34:57,728 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 50.1.100.233 has 111186.0 MB, 45.9 cpus > 2018-01-09 07:34:57,728 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 50.1.100.231 has 123474.0 MB, 47.2 cpus > 2018-01-09 07:34:58,069 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Launching Mesos task taskmanager-00005 on host 50.1.100.231. > 2018-01-09 07:34:58,069 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Launched 5 > task(s) on 50.1.100.231 using 1 offer(s): > 2018-01-09 07:34:58,070 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Launching Mesos task taskmanager-00002 on host 50.1.100.231. > 2018-01-09 07:34:58,070 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Launching Mesos task taskmanager-00003 on host 50.1.100.231. > 2018-01-09 07:34:58,070 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Launching Mesos task taskmanager-00004 on host 50.1.100.231. > 2018-01-09 07:34:58,070 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > Launching Mesos task taskmanager-00001 on host 50.1.100.231. > 2018-01-09 07:34:58,070 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - > 59b85b42-a4a2-4632-9578-9e480585ecdc-O2177 > 2018-01-09 07:34:58,071 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - No longer > gathering offers; all requests fulfilled. > 2018-01-09 07:34:58,072 INFO com.netflix.fenzo.TaskScheduler > - Expiring all leases > 2018-01-09 07:34:58,072 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined > offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of > 124530.0 MB, 46.6 cpus. > 2018-01-09 07:34:58,073 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined > offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of > 123506.0 MB, 47.3 cpus. > 2018-01-09 07:34:58,073 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined > offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of > 123474.0 MB, 47.2 cpus. > 2018-01-09 07:34:58,074 INFO > org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined > offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of > 111186.0 MB, 45.9 cpus. > 2018-01-09 07:35:05,868 INFO org.apache.flink.mesos.scheduler.TaskMonitor > - Mesos task taskmanager-00005 is running. > 2018-01-09 07:35:06,103 INFO org.apache.flink.mesos.scheduler.TaskMonitor > - Mesos task taskmanager-00001 is running. > 2018-01-09 07:35:06,111 INFO org.apache.flink.mesos.scheduler.TaskMonitor > - Mesos task taskmanager-00004 is running. > 2018-01-09 07:35:06,116 INFO org.apache.flink.mesos.scheduler.TaskMonitor > - Mesos task taskmanager-00002 is running. > 2018-01-09 07:35:06,119 INFO org.apache.flink.mesos.scheduler.TaskMonitor > - Mesos task taskmanager-00003 is running. > 2018-01-09 07:35:14,377 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > TaskManager taskmanager-00003 has started. > 2018-01-09 07:35:14,380 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at DNN-G08-231 > (akka.tcp://flink@dnn-g08-231:1027/user/taskmanager) as > b94277c8ad550eeef5364947e4330c00. Current number of registered hosts is 1. > Current number of alive task slots is 8. > 2018-01-09 07:35:14,389 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > TaskManager taskmanager-00004 has started. > 2018-01-09 07:35:14,389 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at DNN-G08-231 > (akka.tcp://flink@dnn-g08-231:1033/user/taskmanager) as > e0183a5317b331b90496049b1893c922. Current number of registered hosts is 2. > Current number of alive task slots is 16. > 2018-01-09 07:35:14,462 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > TaskManager taskmanager-00001 has started. > 2018-01-09 07:35:14,462 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at DNN-G08-231 > (akka.tcp://flink@dnn-g08-231:1029/user/taskmanager) as > 8d85b49d4118514552fcad3b98fef3e2. Current number of registered hosts is 3. > Current number of alive task slots is 24. > 2018-01-09 07:35:14,465 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > TaskManager taskmanager-00005 has started. > 2018-01-09 07:35:14,465 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at DNN-G08-231 > (akka.tcp://flink@dnn-g08-231:1031/user/taskmanager) as > b740607fb2e88bcfc275498bb54ed9fd. Current number of registered hosts is 4. > Current number of alive task slots is 32. > 2018-01-09 07:35:14,560 INFO > org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - > TaskManager taskmanager-00002 has started. > 2018-01-09 07:35:14,560 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at DNN-G08-231 > (akka.tcp://flink@dnn-g08-231:1025/user/taskmanager) as > 95433440f37ea1790e7ef9309f110fe4. Current number of registered hosts is 5. > Current number of alive task slots is 40. > > > > 2) After the TaskManagers are started, the following lines are repeated in > the JobManage log every second: > > 2018-01-09 07:36:51,080 ERROR > org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - > Caught exception > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:748) > 2018-01-09 07:37:43,600 ERROR > org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - > Caught exception > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:748) > 2018-01-09 07:38:43,619 ERROR > org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - > Caught exception > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:748) > 2018-01-09 07:39:43,630 ERROR > org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - > Caught exception > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:748) > > - Can I ignore this exception? or there's something I should fix up? > > Best, > > - Dongwon > >