Coming back to this issue, looks like it was a result of the centos 7 systemd cleanup task on tmp:
/usr/lib/tmpfiles.d/tmp.conf # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. # See tmpfiles.d(5) for details # Clear tmp directories separately, to make them easier to override v /tmp 1777 root root 10d v /var/tmp 1777 root root 30d # Exclude namespace mountpoints created with PrivateTmp=yes x /tmp/systemd-private-%b-* X /tmp/systemd-private-%b-*/tmp x /var/tmp/systemd-private-%b-* X /var/tmp/systemd-private-%b-*/tmp Cheers! On Thu, May 26, 2016 at 9:27 AM, cs user <acldstk...@gmail.com> wrote: > Hi Ben, > > Thanks for responding. I can't imagine what would have cleaned temp up at > that time. I don't think we have anything in place to do that, it also > appears to happened to both machines at the same time. > > It also appears that the other topics were not affected, there were still > other files present in temp. > > Thanks! > > On Thu, May 26, 2016 at 9:19 AM, Ben Davison <ben.davi...@7digital.com> > wrote: > >> Possibly tmp got cleaned up? >> >> Seems like one of the log files where deleted while a producer was writing >> messages to it: >> >> On Thursday, 26 May 2016, cs user <acldstk...@gmail.com> wrote: >> >> > Hi All, >> > >> > We are running Kafka version 0.9.0.1, at the time the brokers crashed >> > yesterday we were running in a 2 mode cluster. This has now been >> increased >> > to 3. >> > >> > We are not specifying a broker id and relying on kafka generating one. >> > >> > After the brokers crashed (at exactly the same time) we left kafka >> stopped >> > for a while. After kafka was started back up, the broker id's on both >> > servers were incremented, they were 1001/1002 and they flipped to >> > 1003/1004. This seemed to cause some problems as partitions were >> assigned >> > to broker id's which it believed had disappeared and so were not >> > recoverable. >> > >> > We noticed that the broker id's are actually stored in: >> > >> > /tmp/kafka-logs/meta.properties >> > >> > So we set these back to what they were and restarted. Is there a reason >> why >> > these would change? >> > >> > Below are the error logs from each server: >> > >> > Server 1 >> > >> > [2016-05-25 09:05:52,827] INFO [ReplicaFetcherManager on broker 1002] >> > Removed fetcher for partitions [Topic1Heartbeat,1] >> > (kafka.server.ReplicaFetcherManager) >> > [2016-05-25 09:05:52,831] INFO Completed load of log Topic1Heartbeat-1 >> with >> > log end offset 0 (kafka.log.Log) >> > [2016-05-25 09:05:52,831] INFO Created log for partition >> > [Topic1Heartbeat,1] in /tmp/kafka-logs with properties >> {compression.type -> >> > producer, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, >> > min.insync.replicas -> 1, segment. >> > jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, >> > index.interval.bytes -> 4096, unclean.leader.election.enable -> true, >> > retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy >> -> >> > delete, flush.ms -> 9 >> > 223372036854775807, segment.ms -> 604800000, segment.bytes -> >> 1073741824, >> > retention.ms -> 604800000, segment.index.bytes -> 10485760, >> flush.messages >> > -> 9223372036854775807}. (kafka.log.LogManager) >> > [2016-05-25 09:05:52,831] INFO Partition [Topic1Heartbeat,1] on broker >> > 1002: No checkpointed highwatermark is found for partition >> > [Topic1Heartbeat,1] (kafka.cluster.Partition) >> > [2016-05-25 09:14:12,189] INFO [GroupCoordinator 1002]: Preparing to >> > restabilize group Topic1 with old generation 0 >> > (kafka.coordinator.GroupCoordinator) >> > [2016-05-25 09:14:12,190] INFO [GroupCoordinator 1002]: Stabilized group >> > Topic1 generation 1 (kafka.coordinator.GroupCoordinator) >> > [2016-05-25 09:14:12,195] INFO [GroupCoordinator 1002]: Assignment >> received >> > from leader for group Topic1 for generation 1 >> > (kafka.coordinator.GroupCoordinator) >> > [2016-05-25 09:14:12,749] FATAL [Replica Manager on Broker 1002]: >> Halting >> > due to unrecoverable I/O error while handling produce request: >> > (kafka.server.ReplicaManager) >> > kafka.common.KafkaStorageException: I/O exception in append to log >> > '__consumer_offsets-0' >> > at kafka.log.Log.append(Log.scala:318) >> > at kafka.cluster.Partition$$anonfun$9.apply(Partition. >> scala:442) >> > at kafka.cluster.Partition$$anonfun$9.apply(Partition. >> scala:428) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:268) >> > at >> > kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:428) >> > at >> > >> > kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2. >> apply(ReplicaManager.scala:401) >> > at >> > >> > kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2. >> apply(ReplicaManager.scala:386) >> > at >> > >> > scala.collection.TraversableLike$$anonfun$map$ >> 1.apply(TraversableLike.scala:245) >> > at >> > >> > scala.collection.TraversableLike$$anonfun$map$ >> 1.apply(TraversableLike.scala:245) >> > at scala.collection.immutable.Map$Map1.foreach(Map.scala:116) >> > at >> > scala.collection.TraversableLike$class.map(TraversableLike.scala:245) >> > at scala.collection.AbstractTraversable.map( >> Traversable.scala:104) >> > at >> > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:386) >> > at >> > kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:322) >> > at >> > >> > kafka.coordinator.GroupMetadataManager.store( >> GroupMetadataManager.scala:228) >> > at >> > >> > kafka.coordinator.GroupCoordinator$$anonfun$ >> handleCommitOffsets$9.apply(GroupCoordinator.scala:429) >> > at >> > >> > kafka.coordinator.GroupCoordinator$$anonfun$ >> handleCommitOffsets$9.apply(GroupCoordinator.scala:429) >> > at scala.Option.foreach(Option.scala:257) >> > at >> > >> > kafka.coordinator.GroupCoordinator.handleCommitOffsets( >> GroupCoordinator.scala:429) >> > at >> > kafka.server.KafkaApis.handleOffsetCommitRequest(KafkaApis.scala:280) >> > at kafka.server.KafkaApis.handle(KafkaApis.scala:76) >> > at >> > kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) >> > at java.lang.Thread.run(Thread.java:745) >> > Caused by: java.io.FileNotFoundException: >> > /tmp/kafka-logs/__consumer_offsets-0/00000000000000000000.index (No >> such >> > file or directory) >> > at java.io.RandomAccessFile.open0(Native Method) >> > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) >> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) >> > at >> > kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:277) >> > at >> > kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply$mcV$sp(OffsetIndex.scala:265) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply(OffsetIndex.scala:265) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply(OffsetIndex.scala:265) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264) >> > at kafka.log.Log.roll(Log.scala:627) >> > at kafka.log.Log.maybeRoll(Log.scala:602) >> > at kafka.log.Log.append(Log.scala:357) >> > ... 23 more >> > >> > >> > Server 2 >> > >> > >> > [2016-05-25 09:14:18,968] INFO [Group Metadata Manager on Broker 1001]: >> > Loading offsets and group metadata from [__consumer_offsets,16] >> > (kafka.coordinator.GroupMetadataManager) >> > [2016-05-25 09:14:19,004] INFO New leader is 1001 >> > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) >> > [2016-05-25 09:14:19,054] FATAL [Replica Manager on Broker 1001]: >> Halting >> > due to unrecoverable I/O error while handling produce request: >> > (kafka.server.ReplicaManager) >> > kafka.common.KafkaStorageException: I/O exception in append to log >> > '__consumer_offsets-0' >> > at kafka.log.Log.append(Log.scala:318) >> > at kafka.cluster.Partition$$anonfun$9.apply(Partition. >> scala:442) >> > at kafka.cluster.Partition$$anonfun$9.apply(Partition. >> scala:428) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:268) >> > at >> > kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:428) >> > at >> > >> > kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2. >> apply(ReplicaManager.scala:401) >> > at >> > >> > kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2. >> apply(ReplicaManager.scala:386) >> > at >> > >> > scala.collection.TraversableLike$$anonfun$map$ >> 1.apply(TraversableLike.scala:245) >> > at >> > >> > scala.collection.TraversableLike$$anonfun$map$ >> 1.apply(TraversableLike.scala:245) >> > at scala.collection.immutable.Map$Map1.foreach(Map.scala:116) >> > at >> > scala.collection.TraversableLike$class.map(TraversableLike.scala:245) >> > at scala.collection.AbstractTraversable.map( >> Traversable.scala:104) >> > at >> > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:386) >> > at >> > kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:322) >> > at >> > >> > kafka.coordinator.GroupMetadataManager.store( >> GroupMetadataManager.scala:228) >> > at >> > >> > kafka.coordinator.GroupCoordinator$$anonfun$ >> handleCommitOffsets$9.apply(GroupCoordinator.scala:429) >> > at >> > >> > kafka.coordinator.GroupCoordinator$$anonfun$ >> handleCommitOffsets$9.apply(GroupCoordinator.scala:429) >> > at scala.Option.foreach(Option.scala:257) >> > at >> > >> > kafka.coordinator.GroupCoordinator.handleCommitOffsets( >> GroupCoordinator.scala:429) >> > at >> > kafka.server.KafkaApis.handleOffsetCommitRequest(KafkaApis.scala:280) >> > at kafka.server.KafkaApis.handle(KafkaApis.scala:76) >> > at >> > kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) >> > at java.lang.Thread.run(Thread.java:745) >> > Caused by: java.io.FileNotFoundException: >> > /tmp/kafka-logs/__consumer_offsets-0/00000000000000000000.index (No >> such >> > file or directory) >> > at java.io.RandomAccessFile.open0(Native Method) >> > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) >> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) >> > at >> > kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:277) >> > at >> > kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply$mcV$sp(OffsetIndex.scala:265) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply(OffsetIndex.scala:265) >> > at >> > >> > kafka.log.OffsetIndex$$anonfun$trimToValidSize$1. >> apply(OffsetIndex.scala:265) >> > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) >> > at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264) >> > at kafka.log.Log.roll(Log.scala:627) >> > at kafka.log.Log.maybeRoll(Log.scala:602) >> > at kafka.log.Log.append(Log.scala:357) >> > ... 23 more >> > >> > >> > This happened on 2 different servers, so I find it hard to believe they >> > both had i/o problems at the same time. Does anyone have any idea about >> > what might have happened? >> > >> > Thanks! >> > >> >> -- >> >> >> This email, including attachments, is private and confidential. If you >> have >> received this email in error please notify the sender and delete it from >> your system. Emails are not secure and may contain viruses. No liability >> can be accepted for viruses that might be transferred by this email or any >> attachment. Any unauthorised copying of this message or unauthorised >> distribution and publication of the information contained herein are >> prohibited. >> >> 7digital Limited. Registered office: 69 Wilson Street, London EC2A 2BB. >> Registered in England and Wales. Registered No. 04843573. >> > >