I think the problem was that i was using local filesystem in a cluster. Now I have switched to hdfs.
Thanks, Arpit On Sun, May 29, 2016 at 12:57 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > could you please provide the code of your user function that has the > Checkpointed interface and is keeping state? This might give people a > chance of understanding what is going on. > > Cheers, > Aljoscha > > On Sat, 28 May 2016 at 20:55 arpit srivastava <arpit8...@gmail.com> wrote: > >> Hi, >> >> I am using Flink on yarn cluster. My job was running for 2-3 days. After >> that it failed with two errors >> >> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >> Error at remote task manager 'ip-xx.xx.xx.xxx'. >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.decodeMsg(PartitionRequestClientHandler.java:241) >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelRead(PartitionRequestClientHandler.java:164) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) >> at >> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) >> at >> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) >> at >> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847) >> at >> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) >> at >> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) >> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) >> at >> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: >> org.apache.flink.runtime.io.network.partition.ProducerFailedException >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:164) >> at >> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.userEventTriggered(PartitionRequestQueue.java:96) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:308) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:294) >> at >> io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:308) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:294) >> at >> io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:308) >> at >> io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:294) >> at >> io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108) >> at >> io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:308) >> at >> io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:32) >> at >> io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:299) >> at >> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) >> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) >> >> multiple instances of "checkpoint not found exception" >> >> java.lang.Exception: Could not restore checkpointed state to operators >> and functions >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:454) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:209) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.io.FileNotFoundException: >> /mnt/tmp/ce/flink/jit/checkpoint/1363f00ad9c261070babe30b822e8e61/chk-2862/04bb3654-645c-4a0f-93c8-9582140613d2 >> (No such file or directory) >> at java.io.FileInputStream.open0(Native Method) >> at java.io.FileInputStream.open(FileInputStream.java:195) >> at java.io.FileInputStream.<init>(FileInputStream.java:138) >> at >> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:52) >> at >> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143) >> at >> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:51) >> at >> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) >> at >> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:446) >> >> >> Already there are multiple checkpoint files there and directory is >> present. >> >> Before this one of task manager had memory usage of more than 90% and >> that's the ip of remote exception. >> >> Can anybody faced something similar. >> >> Thanks, >> Arpit >> >