[ https://issues.apache.org/jira/browse/KAFKA-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219963#comment-17219963 ]
A. Sophie Blee-Goldman commented on KAFKA-10633: ------------------------------------------------ The issue with directory contents being deleted but not the directories themselves sounds like KAFKA-10564 (also fixed in 2.7.0/2.6.1). I don't believe that particular bug has any real implications, other than being annoying/misleading in the logs – it should still delete everything in the task directory, including the checkpoint file which is how the assignor determines which persistent state is/isn't on an instance. I'm kind of surprised that you would still get a rebalance after redeploying like that when using static membership. Unless it takes longer than the static group membership timeout I guess. To be honest, I'm not that familiar with the specifics of static membership in general – maybe [~bchen225242] can chime in here. But I suppose I would start by checking out the logs on the Streams side after it comes back up, and see if it's logged any reason for triggering a rebalance explicitly. (There are a few reasons to trigger a rebalance after a static member is bounced, for example if it's hostname changed for IQ. If anything like that happened it should be logged clearly) > Constant probing rebalances in Streams 2.6 > ------------------------------------------ > > Key: KAFKA-10633 > URL: https://issues.apache.org/jira/browse/KAFKA-10633 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.6.0 > Reporter: Bradley Peterson > Priority: Major > Attachments: Discover 2020-10-21T23 34 03.867Z - 2020-10-21T23 44 > 46.409Z.csv > > > We are seeing a few issues with the new rebalancing behavior in Streams 2.6. > This ticket is for constant probing rebalances on one StreamThread, but I'll > mention the other issues, as they may be related. > First, when we redeploy the application we see tasks being moved, even though > the task assignment was stable before redeploying. We would expect to see > tasks assigned back to the same instances and no movement. The application is > in EC2, with persistent EBS volumes, and we use static group membership to > avoid rebalancing. To redeploy the app we terminate all EC2 instances. The > new instances will reattach the EBS volumes and use the same group member id. > After redeploying, we sometimes see the group leader go into a tight probing > rebalance loop. This doesn't happen immediately, it could be several hours > later. Because the redeploy caused task movement, we see expected probing > rebalances every 10 minutes. But, then one thread will go into a tight loop > logging messages like "Triggering the followup rebalance scheduled for > 1603323868771 ms.", handling the partition assignment (which doesn't change), > then "Requested to schedule probing rebalance for 1603323868771 ms." This > repeats several times a second until the app is restarted again. I'll attach > a log export from one such incident. -- This message was sent by Atlassian Jira (v8.3.4#803005)