Cong Feng created ZEPPELIN-1055:
-----------------------------------

             Summary: Zeppelin spark job fail because of fake error
                 Key: ZEPPELIN-1055
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1055
             Project: Zeppelin
          Issue Type: Bug
    Affects Versions: 0.5.6
         Environment: zeppelin 0.5.6, Spark 1.6.1 and Hadoop 2.7.2, fair 
scheduler and Yarn preemption enabled.
            Reporter: Cong Feng


Hi,

Our cluster running as above environment. We run the spark job through zeppelin 
UI. As the Yarn preemption enabled, we saw the following exception frequently 
happened to spark job and that cause the job on zeppelin marked as error and 
stop running:

16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
java.io.IOException: Failed to send RPC 5721681506291542850 to 
nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: 
java.nio.channels.ClosedChannelException
at 
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
at 
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
And
16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection 
from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests 
outstanding when connection from 
nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.

We run the same job through Spark shell, make it get preempted. We saw the same 
exception as above, but the Spark shell seems to handle them and keep the job 
in progress, which is able to drive to the final result. So we may think those 
errors are not the fatal to spark and will not cause spark job fail (at least 
in our test case). Is there sth on zeppelin that mark them fatal error and fail 
the job?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to