Hi Omnia, Hi Claude, Thanks for putting this KIP together. This is an important unresolved issue in Kafka, which I have witnessed several times in production.
Please see my questions below: 10 Given the goal is to prevent OOMs, do we also need to limit the number of KafkaPrincipals in use? 11. How would an operator know or decide to change the configuration for the number layers – producer.id.quota.cache.layer.count – e.g. increasing from 4 to 5; and why? Do we need a new metric to indicate that change could be useful? 12. Is producer.id.quota.cache.cleanup.scheduler.interval.ms a guaranteed interval, or rather simply a delay between cleanups? How did you decide on the default value of 10ms? 13. Under "New ProducerIdQuotaManagerCache", the documentation for the constructor params for ProducerIDQuotaManagerCache does not match the constructor signature. 14. Under "New ProducerIdQuotaManagerCache": public boolean track(KafkaPrincipal principal, int producerIdRate, long pid) How is producerIdRate used? The reference implementation Claude shared does not use it. https://github.com/Claudenw/kafka/blob/49b6eb0fb5cfaf19b072fd87986072a683ab976c/storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerIDQuotaManager.java 15. I could not find a description or definition for TimestampedBloomFilter, could we add that to the KIP? 16. LayeredBloomFilter will have a fixed size (right?), but some users (KafkaPrincipal) might only use a small number of PIDs. It it worth having a dual strategy, where we simply keep a Set of PIDs until we reach certain size where it pays off to use the LayeredBloomFilter? 17. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests", the KIP states: a. INIT_PRODUCER_ID for idempotent producer request PIDs from random controller every time so if a client got throttled on one controller doesn't guarantee it will not go through on next controller causing OOM at the leader later. Is the INIT_PRODUCER_ID request really sent to a "random controller"? >From a quick look at Sender.maybeSendAndPollTransactionalRequest, for an idempotent producer, targetNode is set to the broker with fewest outstanding requests. Am I looking at the wrong place? 18. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests", the KIP states: This solution might look simple however throttling the INIT_PRODUCER_ID doesn't guarantee the OOM wouldn't happened as (...) b. The problem happened on the activation of the PID when it produce and not at the initialisation. Which means Kafka wouldn't have OOM problem if the producer got assigned PID but crashed before producing anything. Point b. does not seem to support the claim above? 19. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests", the KIP states: c. Throttling producers that crash between initialisation and producing could slow them down when they recover/fix the problem that caused them to crash right after initialising PID. Doesn't it depend on the back-off time or how quotas are enforced? I’m not sure this would necessarily be a problem? 20. If the allocation of PIDs for idempotent producers was centralized, or otherwise the the targetNode for that request was predictable, would that make throttling INIT_PRODUCER_ID a viable solution? Best, -- Igor