Hello Kafka community, I've encountered an issue with OAuth authentication in Kafka when running on a system that goes to sleep/hibernates. I believe I've identified a flaw in the token refresh mechanism that affects reliability in certain environments.
When using OAuth authentication between brokers and controllers, the token refresh mechanism fails after system sleep/hibernation, causing all authentication to fail until the service is restarted. I observed this on my Confluent Platform setup running on a MacBook: - OAuth token was set to refresh at 18:31:29 - System went to sleep at 18:19 - System woke up at 18:53, after the tokens had expired at 18:42 - No refresh login attempt occurred after wakeup - All authentication failed with expired tokens After reviewing the ExpiringCredentialRefreshingLogin class code, I can see the issue stems from how the refresh thread sleeps until the next scheduled refresh time: log.info("[Principal={}]: Expiring credential re-login sleeping until: {}", principalLogText(), new Date(nextRefreshMs)); time.sleep(nextRefreshMs - nowMs); When the system goes to sleep, this thread's execution is suspended. Upon waking, the thread simply continues its sleep operation without any awareness that a significant amount of time may have passed. There's no mechanism to detect that the planned refresh window has been missed due to system suspension. I understand that hibernating a Kafka cluster isn't a common production scenario, and this issue might not affect many users in production environments. However, I believe this vulnerability in the token refresh mechanism could be problematic in certain scenarios like development environments, containerized setups, or any situation where process suspension might occur. Do you consider this behavior a bug that should be addressed? And would you recommend creating a KIP for this issue? I'm asking because while this might be a niche case for production, the authentication failure is particularly frustrating in development environments as it requires a full cluster restart to resolve. Thanks for your help, Adrien