Ok, thanks for the update! Let me know if you run into any more problems.
On Mon, 17 Oct 2016 at 14:40 <robert.lancas...@hyatt.com> wrote: > HI Aljoscha, > > > > Thanks for the response. > > > > To answer your question, the base path did not exist. But, I think I > found the issue. I believe I had some rogue task managers running. As a > troubleshooting step, I attempted to restart my cluster. However, after > shutting down the cluster I noticed that there were still task managers > running on most of my nodes (and on the master). Interestingly, on a > second attempt to shut down the cluster, I received the message “No > taskmanager daemon to stop on host…” for each of my nodes, even though I > could see the flink processes running on these machines. After manually > killing these processes and restarting the cluster, the problem went away. > > > > So, my assumption is that on a previous attempt to bounce the cluster, > these processes did not shut down cleanly. Starting the cluster after that > **may** have resulted in second instances of the task manager running on > most nodes. I’m not certain, however, and I haven’t yet been able to > reproduce the issue. > > > > > > > > > > > > *From: *Aljoscha Krettek <aljos...@apache.org> > *Reply-To: *"user@flink.apache.org" <user@flink.apache.org> > *Date: *Friday, October 14, 2016 at 6:57 PM > *To: *"user@flink.apache.org" <user@flink.apache.org> > *Subject: *Re: job failure with checkpointing enabled > > > > Hi, > > the file that Flink is trying to create there is not meant to be in the > checkpointing location. It is a local file that is used for buffering > elements until a checkpoint barrier arrives (for certain cases). Can you > check whether the base path where it is trying to create that file exists? > For the exception that you posted that would be: > /tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed > > > > Cheers, > > Aljoscha > > > > On Fri, 14 Oct 2016 at 17:37 <robert.lancas...@hyatt.com> wrote: > > I recently tried enabling checkpointing in a job (that previously works > w/o checkpointing) and received the following failure on job execution: > > > > java.io.FileNotFoundException: > /tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed/a426eb27761575b3b79e464719bba96e16a1869d85bae292a2ef7eb72fa8a14c.0.buffer > (No such file or directory) > > at java.io.RandomAccessFile.open0(Native Method) > > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:247) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.<init>(BufferSpiller.java:117) > > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.<init>(BarrierBuffer.java:94) > > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.<init>(StreamInputProcessor.java:96) > > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:49) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:239) > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > > at java.lang.Thread.run(Thread.java:745) > > > > > > The job then restarts and fails again in an endless cycle. > > > > This feels like a configuration issue. My guess is that Flink is looking > for the file above on local storage, though we’ve configured checkpointing > to use hdfs (see below). > > > > To enable checkpointing, this is what I did: > > env.enableCheckpointing(3000l); > > > > Relevant configurations in flink-conf.yaml: > > state.backend: filesystem > > state.backend.fs.checkpointdir: > hdfs://myhadoopnamenode:8020/apps/flink/checkpoints > > > > Note, the directory we’ve configured is not the same as the path indicated > in the error. > > > > Interestingly, there are plenty of subdirs in my checkpoints directory, > these appear to correspond to job start times, even though these jobs don’t > have checkpointing enabled: > > drwxr-xr-x - rtap hdfs 0 2016-10-13 07:48 > /apps/flink/checkpoints/b4870565f148cff10478dca8bff27bf7 > > drwxr-xr-x - rtap hdfs 0 2016-10-13 08:27 > /apps/flink/checkpoints/044b21a0f252b6142e7ddfee7bfbd7d5 > > drwxr-xr-x - rtap hdfs 0 2016-10-13 08:36 > /apps/flink/checkpoints/a658b23c2d2adf982a2cf317bfb3d3de > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:38 > /apps/flink/checkpoints/1156bd1796105ad95a8625cb28a0b816 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:41 > /apps/flink/checkpoints/58fdd94b7836a3b3ed9abc5c8f3a1dd5 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:43 > /apps/flink/checkpoints/47a849a8ed6538b9e7d3826a628d38b9 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:49 > /apps/flink/checkpoints/e6a9e2300ea5c36341fa160adab789f0 > > > > Thanks! > > > > > > > ------------------------------ > > The information contained in this communication is confidential and > intended only for the use of the recipient named above, and may be legally > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please resend it to the sender and delete the original message and copy of > it from your computer system. Opinions, conclusions and other information > in this message that do not relate to our official business should be > understood as neither given nor endorsed by the company. > > > ------------------------------ > The information contained in this communication is confidential and > intended only for the use of the recipient named above, and may be legally > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please resend it to the sender and delete the original message and copy of > it from your computer system. Opinions, conclusions and other information > in this message that do not relate to our official business should be > understood as neither given nor endorsed by the company. >