Oh wow that Harness looks cool, I'll have to take a look at that. Unfortunately the JobManager UI seems to just show this: [image: image.png]
Though it does seem that maybe the source function is where the failure is happening according to this? [image: image.png] Still investigating, but I do see a lot of these logs: 2021-05-28 14:25:09,199 WARN org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction [] - Transaction KafkaTransactionState [transactionalId=feedback-union -> functions -> Sink: bluesteel-kafka_egress-egress-dd0a6f77c8b5eccd4b7254cdfd577ff9-39, producerId=2062, epoch=2684] has been open for 55399128 ms. This is close to or even exceeding the transaction timeout of 900000 ms. Seems like it's restoring some old kafka transaction? Not sure. I like Arvid's idea of attaching a debugger, I'll definitely give that a try. On Fri, May 28, 2021 at 7:49 AM Arvid Heise <ar...@apache.org> wrote: > If logs are not helping, I think the remaining option is to attach a > debugger [1]. I'd probably add a breakpoint to > LegacySourceFunctionThread#run and see what happens. If the issue is in > recovery, you should add a breakpoint to StreamTask#beforeInvoke. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/Remote+Debugging+of+Flink+Clusters > > On Fri, May 28, 2021 at 1:11 PM Igal Shilman <i...@ververica.com> wrote: > >> Hi Tim, >> Any additional logs from before are highly appreciated, this would help >> us to trace this issue. >> By the way, do you see something in the JobManager's UI? >> >> On Fri, May 28, 2021 at 9:06 AM Tzu-Li (Gordon) Tai <tzuli...@apache.org> >> wrote: >> >>> Hi Timothy, >>> >>> It would indeed be hard to figure this out without any stack traces. >>> >>> Have you tried changing to debug level logs? Maybe you can also try >>> using the StateFun Harness to restore and run your job in the IDE - in that >>> case you should be able to see which code exactly is throwing this >>> exception. >>> >>> Cheers, >>> Gordon >>> >>> On Fri, May 28, 2021 at 12:39 PM Timothy Bess <tdbga...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Just checking to see if anyone has experienced this error. Might just >>>> be a Flink thing that's irrelevant to statefun, but my job keeps failing >>>> over and over with this message: >>>> >>>> 2021-05-28 03:51:13,001 INFO >>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer [] - >>>> Starting FlinkKafkaInternalProducer (10/10) to produce into default >>>> topic __stateful_functions_random_topic_lNVlkW9SkYrtZ1oK >>>> 2021-05-28 03:51:13,001 INFO >>>> org.apache.flink.streaming.connectors.kafka.internal. >>>> FlinkKafkaInternalProducer [] - Attempting to resume transaction >>>> feedback-union -> functions -> Sink: >>>> bluesteel-kafka_egress-egress-dd0a6f77c8b5eccd4b7254cdfd577ff9-45 with >>>> producerId 31 and epoch 3088 >>>> 2021-05-28 03:51:13,017 WARN org.apache.flink.runtime.taskmanager.Task >>>> [] - Source: lead-leads-ingress -> router (leads) (10/10) >>>> (ff51aacdb850c6196c61425b82718862) switched from RUNNING to FAILED. >>>> java.lang.NullPointerException: null >>>> >>>> The null pointer doesn't come with any stack traces or anything. It's >>>> really mystifying. Seems to just fail while restoring continuously. >>>> >>>> Thanks, >>>> >>>> Tim >>>> >>>