Hello Kafka community,

I've encountered an issue with OAuth authentication in Kafka when running
on a system that goes to sleep/hibernates. I believe I've identified a flaw
in the token refresh mechanism that affects reliability in certain
environments.

When using OAuth authentication between brokers and controllers, the token
refresh mechanism fails after system sleep/hibernation, causing all
authentication to fail until the service is restarted.

I observed this on my Confluent Platform setup running on a MacBook:

   - OAuth token was set to refresh at 18:31:29
   - System went to sleep at 18:19
   - System woke up at 18:53, after the tokens had expired at 18:42
   - No refresh login attempt occurred after wakeup
   - All authentication failed with expired tokens

After reviewing the ExpiringCredentialRefreshingLogin class code, I can see
the issue stems from how the refresh thread sleeps until the next scheduled
refresh time:

log.info("[Principal={}]: Expiring credential re-login sleeping until:
{}", principalLogText(),
        new Date(nextRefreshMs));
time.sleep(nextRefreshMs - nowMs);

When the system goes to sleep, this thread's execution is suspended. Upon
waking, the thread simply continues its sleep operation without any
awareness that a significant amount of time may have passed. There's no
mechanism to detect that the planned refresh window has been missed due to
system suspension.

I understand that hibernating a Kafka cluster isn't a common production
scenario, and this issue might not affect many users in production
environments. However, I believe this vulnerability in the token refresh
mechanism could be problematic in certain scenarios like development
environments, containerized setups, or any situation where process
suspension might occur.

Do you consider this behavior a bug that should be addressed? And would you
recommend creating a KIP for this issue?

I'm asking because while this might be a niche case for production, the
authentication failure is particularly frustrating in development
environments as it requires a full cluster restart to resolve.

Thanks for your help,

Adrien

Reply via email to