Re: Kafka Streams application stuck rebalancing on startup

Alex Jablonski Wed, 28 Oct 2020 12:24:40 -0700

This block:

@EmbeddedKafka(
        topics = {
                "WordCounts", "WordsForNumbers", "OutputTopic"
        }
)


starts up an embedded Kafka in the test and creates the 3 topics (2
input and 1 output). By default it creates them with 2 partitions
each, but changing to 1 partition didn't alter the endless-rebalancing
behavior.

We also see the endless rebalancing behavior in a real Kafka cluster,
using input and output topics that have already been created (and are
readily consumed from and written to).




On Wed, Oct 28, 2020 at 12:45 PM Sophie Blee-Goldman <sop...@confluent.io>
wrote:

> Yeah there's definitely something weird going on (assuming this is the full
> logs over that
> time period). The last thing we see logged from the StreamThread is this
> message from
> around the start of the task assignment process:
>
> 2020-10-28 12:22:37.879 DEBUG 27226 --- [-StreamThread-1]
> o.a.k.s.p.i.StreamsPartitionAssignor     : stream-thread
>
> [demo-application-81060bdc-c8cc-4350-85f8-d238267e264e-StreamThread-1-consumer]
> Constructed client metadata
> {81060bdc-c8cc-4350-85f8-d238267e264e=ClientMetadata{hostInfo=null,
>
> consumers=[demo-application-81060bdc-c8cc-4350-85f8-d238267e264e-StreamThread-1-consumer-976853d9-06ad-4515-abf3-2a7398c12006],
> state=[activeTasks: ([]) standbyTasks: ([]) assignedTasks: ([])
> prevActiveTasks: ([]) prevStandbyTasks: ([]) prevAssignedTasks: ([])
> prevOwnedPartitionsByConsumerId: ([]) capacity: 1]}} from the member
> subscriptions.
>
>
>
> which is at 12:22:37. Then there's nothing else from Streams until at least
> 12:25:00,
> where the logs end. Not sure what it could be doing inside the assignor for
> 2+ minutes
> without ever reaching another...how many partitions are on the input
> topics? Are you
> sure the input topics have been pre-created before starting the app, with
> the correct
> names, etc?
>
> On Wed, Oct 28, 2020 at 10:29 AM Alex Jablonski <
> ajablon...@thoughtworks.com>
> wrote:
>
> > Hi Sophie,
> >
> > Thanks for your questions! Responses inline below. Also, I realized I
> > linked to the gradle file, not the interesting bits of the example. This
> > <
> https://github.com/ajablonski/streams-issue-demo/blob/master/src/main/java/com/github/ajablonski/StreamsConfiguration.java
> >
> > is the configuration and this
> > <
> https://github.com/ajablonski/streams-issue-demo/blob/master/src/test/java/com/github/ajablonski/StreamsConfigurationTest.java
> >
> > is the test.
> >
> > On Tue, Oct 27, 2020 at 10:11 PM Sophie Blee-Goldman <
> sop...@confluent.io>
> > wrote:
> >
> >> >
> >> > We've been able to get the crucial factors that cause this behavior
> >> down to
> >> > a particular combination
> >>
> >> What do you mean by this -- that you only see this when all four of
> those
> >> operators
> >> are at play? Or do you see it with any of them.
> >>
> >
> > We see this when all four operators are in play. If you change the sample
> > streams configuration to not do that final foreign key join, or not use
> > custom serdes for example, I don't see the stuck-state issue (the
> > application transitions to running state just fine).
> >
> >
> >>
> >> I guess the first thing to narrow down is whether it's actually
> >> rebalancing
> >> or just
> >> restoring within this time (the REBALANCING state is somewhat
> >> misleadingly-named).
> >> If this is a completely new app then it's probably not restoring, but if
> >> this app had
> >> already been running and building up state before hitting this issue
> then
> >> that's probably
> >> the reason. It's not at all uncommon for restoration to take more than
> 30
> >> seconds.
> >>
> >
> > This is happening with the app in a completely new state -- in the test,
> > for example, there's no pre-loaded data when we're asserting that the app
> > should eventually get to RUNNING, and none of the internal topics exist.
> >
> >
> >> If it really is rebalancing this entire time, then you need to look into
> >> the logs to figure
> >> out why. I don't see anything obviously wrong with your particular
> >> application, and even
> >> if there was it should never result in endless rebalances like this. How
> >> many instances
> >> of the application are you running?
> >>
> >
> > In our actual application, we have 3 instances, but in the tests in that
> > sample project, there's only 1.
> >
> > The logs that we're getting right before the application gets "stuck" are
> > below. The one that seems most concerning to my uninformed eyes is
> "Member
> >
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> > in group demo-application has failed". I've attached some DEBUG level
> logs
> > too, though nothing was obvious to me that would better explain the
> hanging
> > behavior.
> >
> > 2020-10-28 12:11:19.823  INFO 27127 --- [-StreamThread-1]
> > o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] Discovered group coordinator localhost:50343
> (id:
> > 2147483647 rack: null)
> > 2020-10-28 12:11:19.825  INFO 27127 --- [-StreamThread-1]
> > o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] (Re-)joining group
> > 2020-10-28 12:11:19.842  WARN 27127 --- [-StreamThread-1]
> > org.apache.kafka.clients.NetworkClient   : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] Error while fetching metadata with correlation
> id
> > 7 :
> >
> {demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000018-topic=UNKNOWN_TOPIC_OR_PARTITION,
> > demo-application-GroupName-repartition=UNKNOWN_TOPIC_OR_PARTITION,
> >
> demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000010-topic=UNKNOWN_TOPIC_OR_PARTITION}
> > 2020-10-28 12:11:19.860  INFO 27127 --- [-StreamThread-1]
> > o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] Join group failed with
> > org.apache.kafka.common.errors.MemberIdRequiredException: The group
> member
> > needs to have a valid member id before actually entering a consumer group
> > 2020-10-28 12:11:19.861  INFO 27127 --- [-StreamThread-1]
> > o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] (Re-)joining group
> > 2020-10-28 12:11:19.873  INFO 27127 --- [quest-handler-3]
> > k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]:
> Preparing
> > to rebalance group demo-application in state PreparingRebalance with old
> > generation 0 (__consumer_offsets-2) (reason: Adding new member
> >
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> > with group instance id None)
> > 2020-10-28 12:11:19.882  INFO 27127 --- [cutor-Rebalance]
> > k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]:
> Stabilized
> > group demo-application generation 1 (__consumer_offsets-2)
> > 2020-10-28 12:11:19.947  WARN 27127 --- [-StreamThread-1]
> > org.apache.kafka.clients.NetworkClient   : [Consumer
> >
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> > groupId=demo-application] Error while fetching metadata with correlation
> id
> > 9 :
> >
> {demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000018-topic=UNKNOWN_TOPIC_OR_PARTITION,
> > demo-application-GroupName-repartition=UNKNOWN_TOPIC_OR_PARTITION,
> >
> demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000010-topic=UNKNOWN_TOPIC_OR_PARTITION}
> > 2020-10-28 12:11:23.462  INFO 27127 --- [er-event-thread]
> > kafka.controller.KafkaController         : [Controller id=0] Processing
> > automatic preferred replica leader election
> > 2020-10-28 12:11:29.887  INFO 27127 --- [cutor-Heartbeat]
> > k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Member
> >
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> > in group demo-application has failed, removing it from the group
> > 2020-10-28 12:11:29.887  INFO 27127 --- [cutor-Heartbeat]
> > k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]:
> Preparing
> > to rebalance group demo-application in state PreparingRebalance with old
> > generation 1 (__consumer_offsets-2) (reason: removing member
> >
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> > on heartbeat expiration)
> > 2020-10-28 12:11:29.888  INFO 27127 --- [cutor-Heartbeat]
> > k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Group
> > demo-application with generation 2 is now empty (__consumer_offsets-2)
> >
> >
> >>
> >> Cheers,
> >> Sophie
> >>
> >> On Thu, Oct 15, 2020 at 10:01 PM Alex Jablonski <
> >> ajablon...@thoughtworks.com>
> >> wrote:
> >>
> >> > Hey there!
> >> >
> >> > My team and I have run across a bit of a jam in our application where,
> >> > given a particular setup, our Kafka Streams application never seems to
> >> > start successfully, instead just getting stuck in the REBALANCING
> state.
> >> > We've been able to get the crucial factors that cause this behavior
> >> down to
> >> > a particular combination of (1) grouping, (2) windowing, (3)
> >> aggregating,
> >> > and (4) foreign-key joining, with some of those steps specifying
> Serdes
> >> > besides the default.
> >> >
> >> > It's probably more useful to see a minimal example, so there's one
> here
> >> > <
> >>
> https://github.com/ajablonski/streams-issue-demo/blob/master/build.gradle
> >> > >.
> >> > The underlying Kafka Streams version is 2.5.1. The first test should
> >> show
> >> > the application eventually transition to running state, but it doesn't
> >> > within the 30 second timeout I've set. Interestingly, getting rid of
> the
> >> > 'Grouped.with' argument to the 'groupBy' function and the
> >> > 'Materialized.with' in 'aggregate' in the 'StreamsConfiguration' lets
> >> the
> >> > application transition to "RUNNING", though without the correct Serdes
> >> > that's not too valuable.
> >> >
> >> > There might be a cleaner way to organize the particular flow in the
> toy
> >> > example, but is there something fundamentally wrong with the approach
> >> laid
> >> > out in that application that would cause Streams to be stuck in
> >> > REBALANCING? I'd appreciate any advice folks could give!
> >> >
> >> > Thanks!
> >> > Alex Jablonski
> >> >
> >>
> >
>

Re: Kafka Streams application stuck rebalancing on startup

Reply via email to