Koelli Mungee created KAFKA-7022:
------------------------------------

             Summary: Setting segment.bytes for a topic too small can cause 
ReplicaFetcher thread crash and in turn an unhealthy cluster due to 
under-replicated partitions
                 Key: KAFKA-7022
                 URL: https://issues.apache.org/jira/browse/KAFKA-7022
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 1.0.1
            Reporter: Koelli Mungee


The topic configuration segment.bytes was changed to 14 using the alter 
command. This resulted in ReplicaFetcher threads dying with the following 
exception:


{code:java}
[2018-06-07 21:02:15,669] ERROR [ReplicaFetcher replicaId=7, leaderId=9, 
fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: Error processing data for partition test-11 offset 
2362
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
        at scala.Option.foreach(Option.scala:257)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
topic partition ledger-entry-request-5-11 with start offset 2362 while it 
already exists.
        at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1349)
        at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316)
        at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
        at kafka.log.Log.roll(Log.scala:1316)
        at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303)
        at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726)
        at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640)
        at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
        at kafka.log.Log.append(Log.scala:640)
        at kafka.log.Log.appendAsFollower(Log.scala:623)
        at 
kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
        at 
kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256)
        at kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559)
        at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112)
        at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183)
        ... 13 more
[2018-06-07 21:02:15,669] INFO [ReplicaFetcher replicaId=7, leaderId=9, 
fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
{code}

In order to fix the issue the topic configuration must be changed back to a 
reasonable value and brokers which had ReplicaFetcher threads die need to be 
restarted one at a time to recover the under-replicated partitions. 

A value like 14 bytes is too small to store a message in the log segment. An ls 
-al of the topic partition directory would look something like:


{code:java}
-rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.index 
-rw-r--r--. 1 root root 0 Jun 7 21:02 00000000000000002362.log 
-rw-r--r--. 1 root root 10M Jun 7 21:53 00000000000000002362.timeindex 
-rw-r--r--. 1 root root 4 Jun 7 21:53 leader-epoch-checkpoint
{code}

It would be good to add a check to prevent this configuration to be set to such 
a small value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to