I've reset the state and the job appears to be running smoothly again now.
My guess is that this problem was somehow related to my state becoming too
large (it got to around 20GB before the problem began). I would still like
to get to the bottom of what caused this as resetting the job's state is
n
Hi Scott & Stephan,
The problem has happened a couple more times since yesterday, it's very
strange as my job was running fine for over a week before this started
happening. I find that if I restart the job (and restore from the last
checkpoint) it runs fine for a while (couple of hours) before br
Is it possible that you have stalls in your topology?
Reasons could be:
- The data sink blocks or becomes slow for some periods (where are you
sending the data to?)
- If you are using large state and a state backend that only supports
synchronous checkpointing, there may be a delay introduce
Hi Steffan & Josh,
For what it's worth, I've been using the Kinesis connector with very good
results on Flink 1.1.2 and 1.1.3. I updated the Flink Kinesis connector KCL
and AWS SDK dependencies to the following versions:
aws.sdk.version: 1.11.34
aws.kinesis-kcl.version: 1.7.0
My customizations a
Hi Gordon,
Thanks for the fast reply!
You're right about the expired iterator exception occurring just before
each spike. I can't see any signs of long GC on the task managers... CPU
has been <15% the whole time when the spikes were taking place and I can't
see anything unusual in the task manager
Hi Josh,
That warning message was added as part of FLINK-4514. It pops out whenever a
shard iterator was used after 5 minutes it was returned from Kinesis.
The only time spent between after a shard iterator was returned and before it
was used to fetch the next batch of records, is on deserializi
Hey Gordon,
I've been using Flink 1.2-SNAPSHOT for the past week (with FLINK-4514) with
no problems, but yesterday the Kinesis consumer started behaving
strangely... My Kinesis data stream is fairly constant at around 1.5MB/sec,
however the Flink Kinesis consumer started to stop consuming for peri
Hi Steffen,
Turns out that FLINK-4514 just missed Flink 1.1.2 and wasn’t included in the
release (I’ll update the resolve version in JIRA to 1.1.3, thanks for noticing
this!).
The Flink community is going to release 1.1.3 asap, which will include the fix.
If you don’t want to wait for the releas