Yes, this is a Kafka side issue. Since the affected version of Kafka is all below 1.1.0, ideally speaking we should upgrade Kafka minor version on flink-connector-kafka-0.10/0.11 once the fix was back-ported on the Kafka side. However based on the fact that the PR has been merged for 2 years, I am not sure that would eventually happen.
-- Rong On Fri, Mar 13, 2020 at 6:43 AM Aljoscha Krettek <aljos...@apache.org> wrote: > Thanks for the update! > > On 13.03.20 13:47, Rong Rong wrote: > > 1. I think we have finally pinpointed what the root cause to this issue > is: > > When partitions are assigned manually (e.g. with assign() API instead > > subscribe() API) the client will not try to rediscover the coordinator if > > it dies [1]. This seems to no longer be an issue after Kafka 1.1.0 > > After cherry-picking the PR[2] back to Kafka 0.11.x branch and package it > > with our Flink application, we haven't seen this issue re-occurred so > far. > > So the solution to this thread is: we don't do anything because it is a > Kafka bug that was fixed? > > > 2. The GROUP_OFFSETS is in fact the default startup mode if Checkpoint is > > not enable - that's why I was a bit surprise that this problem was > reported > > so many times. > > To follow up on the question "whether resuming from GROUP_OFFSETS are > > useful": there are definitely use cases where users don't want to use > > checkpointing (e.g. due to resource constraint, storage cost > consideration, > > etc), but somehow still want to avoid a certain amount of data loss. Most > > of our analytics use cases falls into this category. > > Yes, this is what I assumed. I was not suggesting to remove the feature. > We also just leave it as is, right? > > Best, > Aljoscha >