Re: [FLINK-14076] non-deserializable root cause in DeclineCheckpoint

2019-09-23 Thread Piotr Nowojski
Hi, I guess the TaskManager should have logged the original exception somewhere (I’m not saying that we shouldn’t solve this, just to make sure that the basics are covered), so you should already be able to deduce the reason of failure, right? I think that option 2. would not only be easier, b

Re: [FLINK-14076] non-deserializable root cause in DeclineCheckpoint

2019-09-23 Thread Till Rohrmann
Hi Jeffrey, thanks for reporting this issue and starting a discussion how to solve this problem. I've pulled in Piotr who is working on the checkpointing part of Flink. If a user generated exception can get reported, then we need to make sure that it is properly handled. Approach 2. would be easi

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Jeffrey Martin
Draft PR here: https://github.com/apache/flink/pull/9742 There might be some failing tests (still waiting on Travis), but I think the diff is small enough for you to evaluate the approach for acceptability. On Sun, Sep 22, 2019 at 9:10 PM Terry Wang wrote: > Hi Jeffrey, > > You are right and I u

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Terry Wang
Hi Jeffrey, You are right and I understood what you have said after I just studied the class org.apache.flink.util.SerializedThrowable. I prefer the fixes no.2 you mentioned: CheckpointException should always capture its wrapped exception as a SerializedThrowable Looking forward to se

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Jeffrey Martin
Hi Terry, KafkaException comes in through the job's dependencies (it's defined in the kafka-clients jar packed up in the fat job jar) and is on either the TM nor JM default classpath. The job running in the TM includes the job dependencies and so can throw a KafkaException but the JM can't deseria

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Terry Wang
Hi, Jeffrey~ Thanks for your detailed explanation and I understood why job failed with flink 1.9. But the two fixes you mentioned may still not work well. As KafkaException can be serialized in TM for there is necessary jar in its classpath but not in JM, so maybe it’s impossible to check the

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Jeffrey Martin
Thanks for suggestion, Terry. I've investigated a bit further. DeclineCheckpoint specifically checks for the possibility of an exception that the JM won't be able to deserialize (i.e. anything other than a Checkpoint exception). It just doesn't check for the possibility of a CheckpointException th

Re: non-deserializable root cause in DeclineCheckpoint

2019-09-22 Thread Terry Wang
Hi, Jeffrey~ I think two fixes you mentioned may not work in your case. This problem https://issues.apache.org/jira/browse/FLINK-14076 is caused by TM and JM jar package environment inconsistent or jar loaded behavior inconsistent in nature. M

non-deserializable root cause in DeclineCheckpoint

2019-09-21 Thread Jeffrey Martin
JIRA ticket: https://issues.apache.org/jira/browse/FLINK-14076 I'm on Flink v1.9 with the Kafka connector and a standalone JM. If FlinkKafkaProducer fails while checkpointing, it throws a KafkaException which gets wrapped in a CheckpointException which is sent to the JM as a DeclineCheckpoint. Ka

Re: [FLINK-14076] non-deserializable root cause in DeclineCheckpoint

2019-09-20 Thread Jeffrey Martin
To be clear -- I'm happy to make a PR for either option below. (Either is <10 lines diff.) It's just the contributor guidelines said to get consensus first and then only make a PR if I'm assigned to do the work. On Fri, Sep 20, 2019 at 12:23 PM Jeffrey Martin wrote: > (possible dupe; I wasn't su

[FLINK-14076] non-deserializable root cause in DeclineCheckpoint

2019-09-20 Thread Jeffrey Martin
(possible dupe; I wasn't subscribed before and the previous message didn't seem to go through) I'm on Flink v1.9 with the Kafka connector and a standalone JM. If FlinkKafkaProducer fails while checkpointing, it throws a KafkaException which gets wrapped in a CheckpointException which is sent to th