Re: [DISCUSS] KIP-936 Throttle number of active PIDs

Igor Soarez Wed, 01 May 2024 07:42:56 -0700

Hi Omnia, Hi Claude,

Thanks for putting this KIP together.
This is an important unresolved issue in Kafka,
which I have witnessed several times in production.

Please see my questions below:

10 Given the goal is to prevent OOMs, do we also need to
limit the number of KafkaPrincipals in use?

11. How would an operator know or decide to change the configuration
for the number layers – producer.id.quota.cache.layer.count –
e.g. increasing from 4 to 5; and why?
Do we need a new metric to indicate that change could be useful?

12. Is producer.id.quota.cache.cleanup.scheduler.interval.ms a
guaranteed interval, or rather simply a delay between cleanups?
How did you decide on the default value of 10ms?

13. Under "New ProducerIdQuotaManagerCache", the documentation for
the constructor params for ProducerIDQuotaManagerCache does not
match the constructor signature.

14. Under "New ProducerIdQuotaManagerCache":
public boolean track(KafkaPrincipal principal, int producerIdRate, long pid)
How is producerIdRate used? The reference implementation Claude shared
does not use it.
https://github.com/Claudenw/kafka/blob/49b6eb0fb5cfaf19b072fd87986072a683ab976c/storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerIDQuotaManager.java

15. I could not find a description or definition for
TimestampedBloomFilter, could we add that to the KIP?

16. LayeredBloomFilter will have a fixed size (right?), but some
users (KafkaPrincipal) might only use a small number of PIDs.
It it worth having a dual strategy, where we simply keep a Set of
PIDs until we reach certain size where it pays off to use
the LayeredBloomFilter?

17. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests",
the KIP states:

a. INIT_PRODUCER_ID for idempotent producer request PIDs from
random controller every time so if a client got throttled on
one controller doesn't guarantee it will not go through on next
controller causing OOM at the leader later.

Is the INIT_PRODUCER_ID request really sent to a "random controller"?
>From a quick look at Sender.maybeSendAndPollTransactionalRequest,
for an idempotent producer, targetNode is set to the broker with
fewest outstanding requests. Am I looking at the wrong place?

18. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests",
the KIP states:

This solution might look simple however throttling the INIT_PRODUCER_ID
doesn't guarantee the OOM wouldn't happened as
(...)
b. The problem happened on the activation of the PID when it
produce and not at the initialisation. Which means Kafka wouldn't
have OOM problem if the producer got assigned PID but crashed before
producing anything.

Point b. does not seem to support the claim above?

19. Under "Rejected Alternatives" > "4. Throttle INIT_PRODUCER_ID requests",
the KIP states:

c. Throttling producers that crash between initialisation and
producing could slow them down when they recover/fix the
problem that caused them to crash right after initialising PID.

Doesn't it depend on the back-off time or how quotas are enforced?
I’m not sure this would necessarily be a problem?

20. If the allocation of PIDs for idempotent producers was
centralized, or otherwise the the targetNode for that request
was predictable, would that make throttling INIT_PRODUCER_ID
a viable solution?

Best,

--
Igor

Re: [DISCUSS] KIP-936 Throttle number of active PIDs

Reply via email to