Rohit Bobade created KAFKA-13269: ------------------------------------ Summary: Kafka Streams Aggregation data loss between instance restarts and rebalances Key: KAFKA-13269 URL: https://issues.apache.org/jira/browse/KAFKA-13269 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 2.6.2 Reporter: Rohit Bobade
Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also setting Processing Guarantee - EXACTLY_ONCE_BETA and NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting instances in middle while processing to test fault tolerance. The output count is incorrect because of data loss while restoring state. It looks like the streams task becomes active and starts processing even when the state is not fully restored but is within the acceptable recovery lag (default is 10000) This results in data loss {quote}A stateful active task is assigned to an instance only when its state is within the configured acceptable.recovery.lag, if one exists {quote} [https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance] [https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag] Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the correct result. Related KIP: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances] Just want to get some thoughts on this use case from the Kafka team or if anyone has encountered similar issue -- This message was sent by Atlassian Jira (v8.3.4#803005)