Hi, I've launched JobManager and TaskManager on DC/OS successfully. Now I have two new issues:
1) All TaskManagers are scheduled on a single node. - Is it intended to maximize data locality and minimize network communication cost? - Is there an option in Flink to adjust the behavior of JobManager when it considers multiple resource offers from different Mesos agents? - I want to schedule TaskManager processes on different GPU servers so that each TaskManger process can use its own GPU cards exclusively. - Below is a part of JobManager log that is occurring while JobManager is negotiating resources with the Mesos master: 2018-01-09 07:34:54,872 INFO org.apache.flink.mesos.runtime.clusterframework.MesosJobManager - JobManager akka.tcp://flink@dnn-g08-233:18026/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000). 2018-01-09 07:34:55,889 INFO org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to Mesos... 2018-01-09 07:34:55,962 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Trying to associate with JobManager leader akka.tcp://flink@dnn-g08-233:18026/user/jobmanager 2018-01-09 07:34:55,977 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-1481183359] - leader session 00000000-0000-0000-0000-000000000000 2018-01-09 07:34:56,479 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00001 with (10240.0 MB, 8.0 cpus). 2018-01-09 07:34:56,481 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00002 with (10240.0 MB, 8.0 cpus). 2018-01-09 07:34:56,481 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00003 with (10240.0 MB, 8.0 cpus). 2018-01-09 07:34:56,481 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00004 with (10240.0 MB, 8.0 cpus). 2018-01-09 07:34:56,481 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00005 with (10240.0 MB, 8.0 cpus). 2018-01-09 07:34:56,483 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Now gathering offers for at least 5 task(s). 2018-01-09 07:34:56,484 INFO org.apache.flink.mesos.scheduler.ConnectionMonitor - Connected to Mesos as framework ID 59b85b42-a4a2-4632-9578-9e480585ecdc-0004. 2018-01-09 07:34:56,690 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Received offer(s) of 606170.0 MB, 234.2 cpus: 2018-01-09 07:34:56,692 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 111186.0 MB, 45.9 cpus for [*] 2018-01-09 07:34:56,692 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of 123506.0 MB, 47.3 cpus for [*] 2018-01-09 07:34:56,692 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of 124530.0 MB, 46.6 cpus for [*] 2018-01-09 07:34:56,692 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2177 from 50.1.100.231 of 123474.0 MB, 47.2 cpus for [*] 2018-01-09 07:34:56,693 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of 123474.0 MB, 47.2 cpus for [*] 2018-01-09 07:34:57,711 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Processing 5 task(s) against 5 new offer(s) plus outstanding offers. 2018-01-09 07:34:57,726 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Resources considered: (note: expired offers not deducted from below) 2018-01-09 07:34:57,727 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 50.1.100.234 has 124530.0 MB, 46.6 cpus 2018-01-09 07:34:57,728 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 50.1.100.235 has 123506.0 MB, 47.3 cpus 2018-01-09 07:34:57,728 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 50.1.100.232 has 123474.0 MB, 47.2 cpus 2018-01-09 07:34:57,728 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 50.1.100.233 has 111186.0 MB, 45.9 cpus 2018-01-09 07:34:57,728 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 50.1.100.231 has 123474.0 MB, 47.2 cpus 2018-01-09 07:34:58,069 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00005 on host 50.1.100.231. 2018-01-09 07:34:58,069 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Launched 5 task(s) on 50.1.100.231 using 1 offer(s): 2018-01-09 07:34:58,070 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00002 on host 50.1.100.231. 2018-01-09 07:34:58,070 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00003 on host 50.1.100.231. 2018-01-09 07:34:58,070 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00004 on host 50.1.100.231. 2018-01-09 07:34:58,070 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00001 on host 50.1.100.231. 2018-01-09 07:34:58,070 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - 59b85b42-a4a2-4632-9578-9e480585ecdc-O2177 2018-01-09 07:34:58,071 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - No longer gathering offers; all requests fulfilled. 2018-01-09 07:34:58,072 INFO com.netflix.fenzo.TaskScheduler - Expiring all leases 2018-01-09 07:34:58,072 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of 124530.0 MB, 46.6 cpus. 2018-01-09 07:34:58,073 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of 123506.0 MB, 47.3 cpus. 2018-01-09 07:34:58,073 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of 123474.0 MB, 47.2 cpus. 2018-01-09 07:34:58,074 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 111186.0 MB, 45.9 cpus. 2018-01-09 07:35:05,868 INFO org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00005 is running. 2018-01-09 07:35:06,103 INFO org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00001 is running. 2018-01-09 07:35:06,111 INFO org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00004 is running. 2018-01-09 07:35:06,116 INFO org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00002 is running. 2018-01-09 07:35:06,119 INFO org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00003 is running. 2018-01-09 07:35:14,377 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - TaskManager taskmanager-00003 has started. 2018-01-09 07:35:14,380 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1027/user/taskmanager) as b94277c8ad550eeef5364947e4330c00. Current number of registered hosts is 1. Current number of alive task slots is 8. 2018-01-09 07:35:14,389 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - TaskManager taskmanager-00004 has started. 2018-01-09 07:35:14,389 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1033/user/taskmanager) as e0183a5317b331b90496049b1893c922. Current number of registered hosts is 2. Current number of alive task slots is 16. 2018-01-09 07:35:14,462 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - TaskManager taskmanager-00001 has started. 2018-01-09 07:35:14,462 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1029/user/taskmanager) as 8d85b49d4118514552fcad3b98fef3e2. Current number of registered hosts is 3. Current number of alive task slots is 24. 2018-01-09 07:35:14,465 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - TaskManager taskmanager-00005 has started. 2018-01-09 07:35:14,465 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1031/user/taskmanager) as b740607fb2e88bcfc275498bb54ed9fd. Current number of registered hosts is 4. Current number of alive task slots is 32. 2018-01-09 07:35:14,560 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - TaskManager taskmanager-00002 has started. 2018-01-09 07:35:14,560 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1025/user/taskmanager) as 95433440f37ea1790e7ef9309f110fe4. Current number of registered hosts is 5. Current number of alive task slots is 40. 2) After the TaskManagers are started, the following lines are repeated in the JobManage log every second: 2018-01-09 07:36:51,080 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Caught exception java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:748) 2018-01-09 07:37:43,600 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Caught exception java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:748) 2018-01-09 07:38:43,619 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Caught exception java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:748) 2018-01-09 07:39:43,630 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Caught exception java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:748) - Can I ignore this exception? or there's something I should fix up? Best, - Dongwon