[ https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251371#comment-16251371 ]
Ben Corlett commented on KAFKA-6194: ------------------------------------ In the past topics have been deleted and recreated with the same name. Unfortunately this has resulted in "lingering" topics that had to manually be removed from zookeeper. These days we version the topics and create new ones with new names. It might still happen but very rarely will we do this. In the last 24 hours we have had another 15 failures. I've sanitised (renamed the topics) the log file from one of them and attached it to the issue. {code} [2017-11-14 03:33:30,238] ERROR Uncaught exception in scheduled task 'delete-file' (kafka.utils.KafkaScheduler) kafka.common.KafkaStorageException: Delete of log 00000000000000000000.log.deleted failed. at kafka.log.LogSegment.delete(LogSegment.scala:505) at kafka.log.Log.deleteSeg$1(Log.scala:1490) at kafka.log.Log.$anonfun$asyncDeleteSegment$2(Log.scala:1492) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} I had a brief look though the code and I can see what you mean about the exception only happening if the delete fails and the file still exists. As you can see from the logs we have very aggressive rotation/deletion/compaction of 'topic1'. Probably way more than actually needed. But thats the devs for you. I wonder if we have a sequence of two deletes followed by a compaction finishing and a new .delete file appearing so: 1. delete the file (succeeds) 2. delete the file (fails) 3. compaction finishes and new .delete file created 4. check file exists... it does ... throw exception. Another thing. It always seems to be for 00000000000000000000.log.deleted on all the different servers. Maybe we could enable additional logging. > Server crash while deleting segments > ------------------------------------ > > Key: KAFKA-6194 > URL: https://issues.apache.org/jira/browse/KAFKA-6194 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 1.0.0 > Environment: kafka version: 1.0 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) > 64-Bit Server VM (build 25.131-b11, mixed mode) > Reporter: Ben Corlett > Labels: regression > Attachments: server.log.2017-11-14-03.gz > > > We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock > from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most > likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running > one server on a patched version of 1.0 with the pull request from that. > However today we have had two different servers fall over for non-heap > related reasons. The exceptions in the kafka log are : > {code} > [2017-11-09 15:32:04,037] ERROR Error while deleting segments for > xxxxxxxxxx-49 in dir /mnt/secure/kafka/datalog > (kafka.server.LogDirFailureChannel) > java.io.IOException: Delete of log 00000000000000000000.log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.deleteSeg$1(Log.scala:1596) > at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599) > at > kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving > replicas in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager) > [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task > 'delete-file' (kafka.utils.KafkaScheduler) > org.apache.kafka.common.errors.KafkaStorageException: Error while deleting > segments for xxxxxxxxxxxxxx-49 in dir /mnt/secure/kafka/datalog > Caused by: java.io.IOException: Delete of log > 00000000000000000000.log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.deleteSeg$1(Log.scala:1596) > at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599) > at > kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > ..... > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition > xxxxxxx-83 (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an > offline log directory for partition xxxxxxx-83 > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition > xxxxxxx-89 (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an > offline log directory for partition xxxxxxx-89 > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition > xxxxxxx-76 (kafka.server.ReplicaFetcherThread) > ..... > 2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording the > replica LEO, the partition xxxxxxx-27 hasn't been created. > (kafka.server.ReplicaManager) > [2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording > the replica LEO, the partition xxxxxxxxx-79 hasn't been created. > (kafka.server.ReplicaManager) > [2017-11-09 15:32:05,622] FATAL Shutdown broker because all log dirs in > /mnt/secure/kafka/datalog have failed (kafka.log.LogManager) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)