[ 
https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F Méthot updated KAFKA-13272:
-----------------------------
    Summary: KStream offset stuck after brokers outage when exactly_once 
enabled  (was: KStream offset stuck with exactly_once enabled after brokers 
outage)

> KStream offset stuck after brokers outage when exactly_once enabled
> -------------------------------------------------------------------
>
>                 Key: KAFKA-13272
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13272
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.0
>         Environment: Kafka running on Kubernetes
> centos
>            Reporter: F Méthot
>            Priority: Major
>
> Our KStream app offset stay stuck with exactly_once after outage.
> Running with KStream 2.8, kafka broker 2.8,
> 3 brokers.
> commands topic is 10 partitions (replication 2, min-insync 2)
> command-expiry-store-changelog topic is 10 partitions (replication 2, 
> min-insync 2)
> events topic is 10 partitions (replication 2, min-insync 2)
> with this topology
> Topologies:
>  
> {code:java}
> Sub-topology: 0
>  Source: KSTREAM-SOURCE-0000000000 (topics: [commands])
>  --> KSTREAM-TRANSFORM-0000000001
>  Processor: KSTREAM-TRANSFORM-0000000001 (stores: [])
>  --> KSTREAM-TRANSFORM-0000000002
>  <-- KSTREAM-SOURCE-0000000000
>  Processor: KSTREAM-TRANSFORM-0000000002 (stores: [command-expiry-store])
>  --> KSTREAM-SINK-0000000003
>  <-- KSTREAM-TRANSFORM-0000000001
>  Sink: KSTREAM-SINK-0000000003 (topic: events)
>  <-- KSTREAM-TRANSFORM-0000000002
> {code}
> h3. 
> Attempt 1 at reproducing this issue
>  
> Our stream app runs with processing.guarantee *exactly_once* 
> After a Kafka test outage where all 3 brokers pod were deleted at the same 
> time,
> Brokers restarted and initialized succesfuly.
> When restarting the topology above, one of the tasks would never initialize 
> fully, the restore phase would keep outputting this messages every few 
> minutes:
>  
> {code:java}
> 2021-08-16 14:20:33,421 INFO stream-thread 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> Restoration in progress for 1 partitions. 
> {commands-processor-expiry-store-changelog-8: position=11775908, 
> end=11775911, totalRestored=2002076} 
> [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] 
> (org.apache.kafka.streams.processor.internals.StoreChangelogReader)
> {code}
> Task for partition 8 would never initialize, no more data would be read from 
> the source commands topic for that partition.
>  
> In an attempt to recover, we restarted the stream app with stream 
> processing.guarantee back to at_least_once, than it proceed with reading the 
> changelog and restoring partition 8 fully.
> But we noticed afterward, for the next hour until we rebuilt the system, that 
> partition 8 from command-expiry-store-changelog would not be 
> cleaned/compacted by the log cleaner/compacter compared to other partitions. 
> (could be unrelated, because we have seen that before)
> So we resorted to delete/recreate our command-expiry-store-changelog topic 
> and events topic and regenerate it from the commands, reading from beginning. 
> Things went back to normal
> h3. Attempt 2 at reproducing this issue
> We force-deleted all 3 pod running kafka.
> After that, one of the partition can’t be restored. (like reported in 
> previous attempt)
> For that partition, we noticed these logs on the broker
> {code:java}
> [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: 
> Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, 
> command-expiry-store-changelog-9) while trying to send transaction markers 
> for commands-processor-0_9, these partitions are likely deleted already and 
> hence can be skipped 
> (kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
> Then
> - we stop the kstream app,
> - restarted kafka brokers cleanly
> - Restarting the Kstream app, 
> Those logs messages showed up on the kstream app log:
>  
> {code:java}
> 2021-08-27 18:34:42,413 INFO [Consumer 
> clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer,
>  groupId=commands-processor] The following partitions still have unstable 
> offsets which are not cleared on the broker side: [commands-9], this could be 
> either transactional offsets waiting for completion, or normal offsets 
> waiting for replication after appending to local log 
> [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
>  
> {code}
> This would cause our processor to not consume from that specific source 
> topic-partition.
>  Deleting downstream topic and replaying data would NOT fix the issue 
> (EXACTLY_ONCE or AT_LEAST_ONCE)
> Workaround found:
> Deleted the group associated with the processor, and restarted the kstream 
> application, application went on to process data normally. (We have resigned 
> to use AT_LEAST_ONCE for now )
> KStream config :
> StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
> StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
> StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
> StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
> ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
> StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now 
> AT_LEAST_ONCE)
> producer.delivery.timeout.ms=120000
> consumer.session.timeout.ms=30000
> consumer.heartbeat.interval.ms=10000
> consumer.max.poll.interval.ms=300000
> num.stream.threads=1
>  
> We will be doing more of test and I will update the ticket with any findings.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to