Matthew Jarvie created KAFKA-7821:
-------------------------------------
Summary: Default cache size can lose session windows in
high-throughput deployment
Key: KAFKA-7821
URL: https://issues.apache.org/jira/browse/KAFKA-7821
Project: Kafka
Issue Type: Bug
Affects Versions: 2.1.0, 0.10.2.1
Reporter: Matthew Jarvie
We have observed that with a default cache size, a Streams aggregator will
sometimes fail to find existing, open session windows while handling records.
The effect is that it starts a new session and overwrites the old one and
events fail to aggregate together.
Our topology is fairly simple: We consume from a Kafka topic, group by keys,
aggregate, then produce to another topic. Our aggregator is configured to use a
window session strategy with an inactivity gap of 10 minutes and a retention
period of 10 minutes. The system is deployed in production and handles about
250k messages per thread per minute (4 threads per application). The cache size
is left default (10 MB).
We worked around the issue by enlarging the cache (cache.max.bytes.buffering
configuration parameter from 10 MB to 100MB) and no longer observe the issue at
all. While troubleshooting, we noticed that older sessions would be the ones
lost, so it seems like the cache is an LRU cache and is evicting windows before
their inactivity time is up.
This was originally observed in 10.2.1. We completed an upgrade to 2.1.0 and
still observed the issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)