Koelli Mungee created KAFKA-7022: ------------------------------------ Summary: Setting segment.bytes for a topic too small can cause ReplicaFetcher thread crash and in turn an unhealthy cluster due to under-replicated partitions Key: KAFKA-7022 URL: https://issues.apache.org/jira/browse/KAFKA-7022 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.0.1 Reporter: Koelli Mungee
The topic configuration segment.bytes was changed to 14 using the alter command. This resulted in ReplicaFetcher threads dying with the following exception: {code:java} [2018-06-07 21:02:15,669] ERROR [ReplicaFetcher replicaId=7, leaderId=9, fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread) kafka.common.KafkaException: Error processing data for partition test-11 offset 2362 at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: kafka.common.KafkaException: Trying to roll a new log segment for topic partition ledger-entry-request-5-11 with start offset 2362 while it already exists. at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1349) at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316) at kafka.log.Log.maybeHandleIOException(Log.scala:1678) at kafka.log.Log.roll(Log.scala:1316) at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303) at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726) at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640) at kafka.log.Log.maybeHandleIOException(Log.scala:1678) at kafka.log.Log.append(Log.scala:640) at kafka.log.Log.appendAsFollower(Log.scala:623) at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560) at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256) at kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183) ... 13 more [2018-06-07 21:02:15,669] INFO [ReplicaFetcher replicaId=7, leaderId=9, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) {code} In order to fix the issue the topic configuration must be changed back to a reasonable value and brokers which had ReplicaFetcher threads die need to be restarted one at a time to recover the under-replicated partitions. A value like 14 bytes is too small to store a message in the log segment. An ls -al of the topic partition directory would look something like: {code:java} -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.index -rw-r--r--. 1 root root 0 Jun 7 21:02 00000000000000002362.log -rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.timeindex -rw-r--r--. 1 root root 4 Jun 7 21:53 leader-epoch-checkpoint {code} It would be good to add a check to prevent this configuration to be set to such a small value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)