What does the log of the failed TaskManager 10.0.88.140 say? On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan <c...@greghogan.com> wrote:
> I removed the use of numactl but left in starting two TaskManagers and am > still seeing TaskManagers crash. > From the JobManager log: > > 17:36:06,412 WARN > akka.remote.ReliableDeliverySupervisor - Association > with remote system [akka.tcp://flink@10.0.88.140:45742] has failed, > address > is now gated for [5000] ms. Reason is: [Disassociated]. > 17:36:06,567 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN > GroupReduce (Compute scores) -> FlatMap (checksum()) (370/2322) > (cac9927a8568c2ad79439262a91478af) switched from RUNNING to FAILED > 17:36:06,572 INFO > org.apache.flink.runtime.jobmanager.JobManager - Status of > job 14d946015fd7b35eb801ea6fee5af9e4 (Flink Java Job at Thu Oct 29 17:34:48 > UTC 2015) changed to FAILING. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (Compute scores) -> FlatMap (checksum())' , caused an error: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated > due to an exception: Connection unexpectedly closed by remote task manager > 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the remote > task manager was lost. > at > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465) > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Error obtaining the sorted input: > Thread 'SortMerger Reading Thread' terminated due to an exception: > Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/ > 10.0.88.140:58558'. This might indicate that the remote task manager was > lost. > at > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089) > at > > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94) > at > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459) > ... 3 more > Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' > terminated due to an exception: Connection unexpectedly closed by remote > task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that > the remote task manager was lost. > at > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) > Caused by: > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/ > 10.0.88.140:58558'. This might indicate that the remote task manager was > lost. > at > > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119) > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > > io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > > io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306) > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > at > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > at > > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > at > > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > at > > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) > at java.lang.Thread.run(Thread.java:745) > 17:36:06,587 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN > GroupReduce (Compute scores) -> FlatMap (checksum()) (367/2322) > (d63c681a18b8164bc24936df1ecb159b) switched from RUNNING to FAILED > > > On Thu, Oct 29, 2015 at 1:00 PM, Stephan Ewen <se...@apache.org> wrote: > > > Hi Greg! > > > > Interesting... When you say the TaskManagers are dropping, are the > > TaskManager processes crashing, or are they loosing connection to the > > JobManager? > > > > Greetings, > > Stephan > > > > > > On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com> wrote: > > > > > I recently discovered that AWS uses NUMA for its largest nodes. An > > example > > > c4.8xlarge: > > > > > > $ numactl --hardware > > > available: 2 nodes (0-1) > > > node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 > > > node 0 size: 29813 MB > > > node 0 free: 24537 MB > > > node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35 > > > node 1 size: 30574 MB > > > node 1 free: 22757 MB > > > node distances: > > > node 0 1 > > > 0: 10 20 > > > 1: 20 10 > > > > > > I discovered yesterday that Flink performed ~20-30% faster on large > > > datasets by running two NUMA-constrained TaskManagers per node. The > > > JobManager node ran a single TaskManager. Resources were divided in > half > > > relative to running a single TaskManager. > > > > > > The changes from the tail of /bin/taskmanager.sh: > > > > > > -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" > > > +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh > > > $STARTSTOP taskmanager "${args[@]}" > > > +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh > > > $STARTSTOP taskmanager "${args[@]}" > > > > > > After reverting this change the system is again stable. I had not > > > experienced issues using numactl when running 16 nodes. > > > > > > Greg > > > > > >