If it is due to heartbeat problem and driver explicitly killed the executors, there should be some driver logs mentioned about it. So you could check the driver log about it. Also container (executor) logs are useful, if this container is killed, then there'll be some signal related logs, like (SIGTERM).
On Thu, Mar 3, 2016 at 4:00 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > There was nothing in nodemanager logs that indicated why container was > killed. > > Here's the guess: Since killed executors were experiencing high GC > activities (full GC) before death they most likely failed to respond to > heart beat to driver or nodemanager and got killed due to it. > > This is more relevant to issue: > 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to > maprnode5 has been quiet for 120000 ms while there are outstanding > requests. Assuming connection is dead; please adjust spark.network.timeout > if this is wrong. > > Following has nothing to do with this. It was raised as I manually killed > application at some point after too many executors were getting killed. > " ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM" > > Thanks > > On Wed, Mar 2, 2016 at 8:22 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> I think that was due to manually killing application. ExecutorLost >> started around 04:46:21 and application was manually killed around >> 05:54:41 >> >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,500 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: >> Stopping container with container Id: >> *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,500 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: >> USER=xactly IP=10.250.70.119 OPERATION=Stop Container Request >> TARGET=ContainerManageImpl RESULT=SUCCESS >> APPID=application_1456722312951_0450 CONTAINERID= >> *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,500 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: >> Container *container_1456722312951_0450_01_000001* transitioned from >> RUNNING to KILLING >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,501 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: >> Cleaning up container *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,507 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit >> code from container *container_1456722312951_0450_01_000001* is : 143 >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,520 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: >> Container *container_1456722312951_0450_01_000001* transitioned from >> KILLING to CONTAINER_CLEANEDUP_AFTER_KILL >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,557 INFO >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: >> Deleting absolute path : >> /tmp/hadoop-xactly/nm-local-dir/usercache/xactly/appcache/application_1456722312951_0450/ >> *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: >> USER=xactly OPERATION=Container Finished - Killed TARGET=ContainerImpl >> RESULT=SUCCESS APPID=application_1456722312951_0450 CONTAINERID= >> *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,558 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: >> Container *container_1456722312951_0450_01_000001* transitioned from >> CONTAINER_CLEANEDUP_AFTER_KILL to DONE >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,566 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: >> Removing *container_1456722312951_0450_01_000001* from application >> application_1456722312951_0450 >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,567 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: >> Considering container*container_1456722312951_0450_01_000001* for >> log-aggregation >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:41,567 INFO org.apache.spark.network.yarn.YarnShuffleService: >> Stopping container *container_1456722312951_0450_01_000001* >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:42,504 INFO >> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed >> completed containers from NM context: >> [container_1456722312951_0450_01_000027, >> container_1456722312951_0450_01_000030, >> *container_1456722312951_0450_01_000001*] >> >> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01 >> 05:54:42,529 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: >> Uploading logs for container *container_1456722312951_0450_01_000001*. >> Current good log dirs are /opt/mapr/hadoop/hadoop-2.5.1/logs/userlogs >> >> On Wed, Mar 2, 2016 at 8:12 AM, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> >>> For some reason, the driver is killed, could you check the node manager >>> log for container_1456722312951_0450_01_000001 on >>> host hdn4.xactlycorporation.local_54259 ? >>> >>> >>> >>> 16/03/01 05:53:27 INFO impl.ContainerManagementProtocolProxy: Opening >>> proxy : hdn10.xactlycorporation.local:38254 >>> 16/03/01 05:54:41 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: >>> SIGTERM >>> 16/03/01 05:54:41 INFO yarn.ApplicationMaster: Final app status: >>> UNDEFINED, exitCode: 0, (reason: Shutdown hook called before final status >>> was reported.) >>> 16/03/01 05:54:41 INFO util.ShutdownHookManager: Shutdown hook called >>> >>> >>> >>> On Wed, Mar 2, 2016 at 10:11 AM, Nirav Patel <npa...@xactlycorp.com> >>> wrote: >>> >>>> Removing spark user group as I don't want to share it publicly. >>>> >>>> Two app logs since job is launched via oozie spark action. >>>> >>>> On Wed, Mar 2, 2016 at 7:37 AM, Jeff Zhang <zjf...@gmail.com> wrote: >>>> >>>>> Do you mind to attach the whole yarn app log ? >>>>> >>>>> On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel <npa...@xactlycorp.com> >>>>> wrote: >>>>> >>>>>> Hi Ryan, >>>>>> >>>>>> I did search "OutOfMemoryError" earlier and just now but it doesn't >>>>>> indicate anything else. >>>>>> >>>>>> Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd. >>>>>> Most of the executors fails at this stage. I don't understand that as >>>>>> well. >>>>>> Because that should be a straight write job to filesystem. All the >>>>>> complex >>>>>> joins and logic were in previous stages which all ran successfully. >>>>>> >>>>>> Thanks >>>>>> Nirav >>>>>> >>>>>> On Wed, Mar 2, 2016 at 2:22 AM, Shixiong(Ryan) Zhu < >>>>>> shixi...@databricks.com> wrote: >>>>>> >>>>>>> Could you search "OutOfMemoryError" in the executor logs? It could >>>>>>> be "OufOfMemoryError: Direct Buffer Memory" or something else. >>>>>>> >>>>>>> On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel <npa...@xactlycorp.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We are using spark 1.5.2 or yarn. We have a spark application >>>>>>>> utilizing about 15GB executor memory and 1500 overhead. However, at >>>>>>>> certain >>>>>>>> stage we notice higher GC time (almost same as task time) spent. These >>>>>>>> executors are bound to get killed at some point. However, nodemanager >>>>>>>> or >>>>>>>> resource manager logs doesn't indicate failure due to 'beyond >>>>>>>> physical/virtual memory limits' nor I see any 'heap space' or 'gc >>>>>>>> overhead >>>>>>>> exceeded' errors in executor logs. Some of these high GC executor gets >>>>>>>> killed eventually but I can't seem to find reason. Based on application >>>>>>>> logs it seems like executor didn't respond to driver for long period of >>>>>>>> time and connection was reset. >>>>>>>> >>>>>>>> Following are logs from 'yarn logs -applicationId appId_1232_xxx' >>>>>>>> >>>>>>>> >>>>>>>> 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in >>>>>>>> stage 8.0 (TID 15318). 2099 bytes result sent to driver >>>>>>>> 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got >>>>>>>> assigned task 15333 >>>>>>>> 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in >>>>>>>> stage 8.0 (TID 15333) >>>>>>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting >>>>>>>> 125 non-empty blocks out of 3007 blocks >>>>>>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started >>>>>>>> 14 remote fetches in 10 ms >>>>>>>> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection >>>>>>>> to maprnode5 has been quiet for 120000 ms while there are outstanding >>>>>>>> requests. Assuming connection is dead; please adjust >>>>>>>> spark.network.timeout >>>>>>>> if this is wrong. >>>>>>>> 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have >>>>>>>> 1 requests outstanding when connection from maprnode5 is closed >>>>>>>> 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while >>>>>>>> starting block fetches >>>>>>>> java.io.IOException: Connection from maprnode5 closed >>>>>>>> at >>>>>>>> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104) >>>>>>>> at >>>>>>>> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) >>>>>>>> at >>>>>>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) >>>>>>>> at >>>>>>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) >>>>>>>> at >>>>>>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) >>>>>>>> at >>>>>>>> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739) >>>>>>>> at >>>>>>>> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659) >>>>>>>> at >>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) >>>>>>>> at >>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) >>>>>>>> at >>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch >>>>>>>> (1/3) for 6 outstanding blocks after 5000 ms >>>>>>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found >>>>>>>> inactive connection to maprnode5, creating a new one. >>>>>>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in >>>>>>>> connection from maprnode5 >>>>>>>> java.io.IOException: Connection reset by peer >>>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>>>>>>> at >>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>>>>>>> at >>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>>>>>> at >>>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) >>>>>>>> at >>>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) >>>>>>>> at >>>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) >>>>>>>> at >>>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) >>>>>>>> at >>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) >>>>>>>> at >>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) >>>>>>>> at >>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) >>>>>>>> at >>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >>>>>>>> at >>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have >>>>>>>> 1 requests outstanding when connection from maprnode5 is closed >>>>>>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while >>>>>>>> starting block fetches >>>>>>>> >>>>>>>> Also, I can see lot of ExecutorLost error in Driver logs but I >>>>>>>> can'\t seem to connect it to Above executor logs. I think executor logs >>>>>>>> should at least have mentioning of executor ID/task ID (EID-TID) and >>>>>>>> not >>>>>>>> just task ID (TID). >>>>>>>> >>>>>>>> this is snippet of driver logs from ui: >>>>>>>> >>>>>>>> 189 15283 0 FAILED PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 >>>>>>>> / ExecutorLostFailure (executor 15 lost) >>>>>>>> >>>>>>>> here we can see executor id is 5 but executor logs itself doesn't >>>>>>>> use this id as reference in log stream so it's hard to cross check >>>>>>>> logs. >>>>>>>> >>>>>>>> >>>>>>>> Anyhow my main issue is to determine cause of executor killing. >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Nirav >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [image: What's New with Xactly] >>>>>>>> <http://www.xactlycorp.com/email-click/> >>>>>>>> >>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>>>> <http://www.youtube.com/xactlycorporation> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [image: What's New with Xactly] >>>>>> <http://www.xactlycorp.com/email-click/> >>>>>> >>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>> <http://www.youtube.com/xactlycorporation> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>>> >>>> >>>> >>>> [image: What's New with Xactly] >>>> <http://www.xactlycorp.com/email-click/> >>>> >>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>>> <https://twitter.com/Xactly> [image: Facebook] >>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>> <http://www.youtube.com/xactlycorporation> >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> >