Cong Feng created ZEPPELIN-1055: ----------------------------------- Summary: Zeppelin spark job fail because of fake error Key: ZEPPELIN-1055 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1055 Project: Zeppelin Issue Type: Bug Affects Versions: 0.5.6 Environment: zeppelin 0.5.6, Spark 1.6.1 and Hadoop 2.7.2, fair scheduler and Yarn preemption enabled. Reporter: Cong Feng
Hi, Our cluster running as above environment. We run the spark job through zeppelin UI. As the Yarn preemption enabled, we saw the following exception frequently happened to spark job and that cause the job on zeppelin marked as error and stop running: 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49 java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedChannelException And 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed. We run the same job through Spark shell, make it get preempted. We saw the same exception as above, but the Spark shell seems to handle them and keep the job in progress, which is able to drive to the final result. So we may think those errors are not the fatal to spark and will not cause spark job fail (at least in our test case). Is there sth on zeppelin that mark them fatal error and fail the job? -- This message was sent by Atlassian JIRA (v6.3.4#6332)