vroyer opened a new issue, #24677: URL: https://github.com/apache/pulsar/issues/24677
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment We are running an HELM deployed pulsar cluster running on an on-prem kubernetes (lunastreaming 4.1.3.18, https://github.com/datastax/pulsar/tree/ls31_4.18 ) with 3 bookies + 3 brokers. ### Issue Description When a bookie is unavailable because of an k8s hardware issue (1 out of 3 bookies with quorum=2), the pulsar-function try to read some metadata from bookkeeper and the unavilable bookie cause a very expensive CPU read retry loop. As the result, the pulsar-function health liveness check fails and k8s kills the pulsar-function pod. Meanwhile, some pulsar connector pods cannot start properly. ### Error messages ```text ``` ### Reproducing the issue This read-retry-loop with no backoff for an unavailable bookie seems to be a bookkeeper issue (bk version 4.16.7), because there is even no BK read backoff setting to mitigate this kind of situation. It is useless to immediately retry reading from a bookie with this error: "Cannot resolve bookieId glpdlskub016:3181, bookie does not exist or it is not running" ### Additional information _No response_ ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
