Re: PartitionNotFoundException when restarting from checkpoint

Seth Wiesman Wed, 14 Mar 2018 07:15:34 -0700

Unfortunately the stack trace was swallowed by the java timer in the 
LocalInputChannel[1], the real error is forwarded out to the main thread but I 
couldn’t figure out how to see that in my logs.


However, I believe I am close to having a reproducible example. Run a 1.4 
DataStream, sinking to kafka 0.11 and cancel with a savepoint. If you then shut 
down the kafka daemon on a single broker but keep the rest proxy up you should 
see this error when you resume.

[1] 
https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L151

[cid:image001.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 
10007 swies...@mediamath.com<mailto:fl...@mediamath.com>



From: Fabian Hueske <fhue...@gmail.com>
Date: Tuesday, March 13, 2018 at 8:02 PM
To: Seth Wiesman <swies...@mediamath.com>, Stefan Richter 
<s.rich...@data-artisans.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Hi Seth,
Thanks for sharing how you resolved the problem!
The problem might have been related to Flink's key groups which are used to 
assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe 
Stefan (in CC) has an idea about the cause.
Also, it would be helpful if you could share the stacktrace of the exception 
(in case you still have it).
Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman 
<swies...@mediamath.com<mailto:swies...@mediamath.com>>:
It turns out the issue was due to our zookeeper installation being in a bad 
state. I am not clear enough on flink’s networking internals to explain how 
this manifested as a partition not found exception, but hopefully this can 
serve as a starting point for other’s who run into the same issue.

[cid:image002.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 
10007 swies...@mediamath.com<mailto:fl...@mediamath.com>



From: Seth Wiesman <swies...@mediamath.com<mailto:swies...@mediamath.com>>
Date: Friday, March 9, 2018 at 11:53 AM
To: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB 
and incremental checkpointing, last night a job failed and became stuck in a 
restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a 
fresh Flink session with no luck. Looking through the logs we can see that the 
specified partition is never registered with the ResultPartitionManager.

My questions are:

1)      Are partitions a part of state or are the ephemeral to the job

2)      If they are not part of state, where would the task managers be getting 
that partition id to begin with

3)      Right now we are logging everything under 
org.apache.flink.runtime.io<http://org.apache.flink.runtime.io>.network, is 
there anywhere else to look

Thank you,

[cid:image003.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 
10007 swies...@mediamath.com<mailto:fl...@mediamath.com>

Re: PartitionNotFoundException when restarting from checkpoint

Reply via email to