Hi,
Sorry for getting back to this so late but I think the underlying problem is
that S3 does not behave as expected by the Bucketing Sink. See this message by
Stephan on the topic:
https://lists.apache.org/thread.html/34b8ede3affb965c7b5ec1e404918b39c282c258809dfd7a6c257a61@%3Cuser.flink.apach
Hi Aljoscha,
I've tried both. When we restart from a manually created savepoint we see
"restored from savepoint" in the Flink Dashboard. If we restart from
externalized checkpoint we see "restored from checkpoint". In both
scenarios we lose data in S3.
Cheers,
Fran
On 20 July 2017 at 17:54, Alj
You said you cancel and restart the job. How do you then restart the Job? From
a savepoint or externalised checkpoint? Do you also see missing data when using
an externalised checkpoint or a savepoint?
Best,
Aljoscha
> On 20. Jul 2017, at 16:15, Francisco Blaya
> wrote:
>
> Forgot to add tha
Forgot to add that when a job gets cancelled via the UI (this is not the
case when the Yarn session is killed) a part file ending in ".pending" does
appear in S3, but that never seems to be promoted to finished upon restart
of the job
On 20 July 2017 at 11:41, Francisco Blaya
wrote:
> Hi,
>
> Th
Hi,
Thanks for your answers.
@Fabian. I can see that Flink's consumer commits the offsets to Kafka, no
problem there. However I'd expect everything that gets read from Kafka to
appear in S3 at some point, even if the job gets stopped/killed before
flushing and then restarted. And that's what is n
Hi,
What Fabian mentioned is true. Flink Kafka Consumer’s exactly-once guarantee
relies on offsets checkpoints as Flink state, and doesn’t rely on the committed
offsets in Kafka.
What we found is that Flink acks Kafka immediately before even writing to S3.
What you mean by ack here is the offs
Hi Fran,
is the DataTimeBucketer acts like a memory buffer and does't managed by flink's
state? If so, then i think the problem is not about Kafka, but about the
DateTimeBucketer. Flink won't take snapshot for the DataTimeBucketer if it not
in any state.
Best,
Sihua Zhou
At 2017-07-20 03
Hi Fran,
did you observe actual data loss due to the problem you are describing or
are you discussing a possible issue based on your observations?
AFAIK, Flink's Kafka consumer keeps track of the offsets itself and
includes these in the checkpoints. In case of a recovery, it does not rely
on the
Hi,
We have a Flink job running on AWS EMR sourcing a Kafka topic and
persisting the events to S3 through a DateTimeBucketer. We configured the
bucketer to flush to S3 with an inactivity period of 5 mins.The rate at
which events are written to Kafka in the first place is very low so it is
easy for