Rohit Bobade created KAFKA-13269:
------------------------------------

             Summary: Kafka Streams Aggregation data loss between instance 
restarts and rebalances
                 Key: KAFKA-13269
                 URL: https://issues.apache.org/jira/browse/KAFKA-13269
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.6.2
            Reporter: Rohit Bobade


Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also 
setting Processing Guarantee - EXACTLY_ONCE_BETA and 
NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting instances 
in middle while processing to test fault tolerance. The output count is 
incorrect because of data loss while restoring state.

It looks like the streams task becomes active and starts processing even when 
the state is not fully restored but is within the acceptable recovery lag 
(default is 10000) This results in data loss
{quote}A stateful active task is assigned to an instance only when its state is 
within the configured acceptable.recovery.lag, if one exists
{quote}
[https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance]

[https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag]

Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the 
correct result.

Related KIP: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances]

Just want to get some thoughts on this use case from the Kafka team or if 
anyone has encountered similar issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to