GitHub user shanthoosh reopened a pull request: https://github.com/apache/samza/pull/792
SAMZA-1638: Recreate SystemProducer on KafkaCheckpointManager.writeCheckpoint failures. Retry loop in the existing `KafkaCheckpointManager` implementation retries using the same `SystemProducer` instance on exception and does not recreate it. When some irrecoverable exceptions occur within the `SystemProducer`, all the subsequent produce message invocations on the `SystemProducer` instance will fail. This had made the entire retry loop on `KafkaCheckpointManager` pointless. This patch consists of the following changes: 1. This patch addresses the above problem by recreating the `SystemProducer` instance on failure and adds a unit test to verify the functionality. 2. Minor code cleanup in classes: `TestKafkaCheckpointManager` and `KafkaCheckpointManager`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shanthoosh/samza kafka_checkpoint_manager_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/792.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #792 ---- commit 1f2fc878a4d17f70d014f6442c6b81bee0137dd7 Author: Shanthoosh Venkataraman <spvenkat@...> Date: 2018-11-02T23:27:26Z Kafka checkpoint manager fix. ---- ---