Last comment, I upgraded to java 1.7 and restarted kafka. It's now stable. But I have not poked at it. I'm just letting it sit for now. Could this have been somehow related to the problem, but just not apparent in the logs, that I was running 1.6 with 0.8.2.1?
On Tue, Jan 12, 2016 at 11:19 PM, Dillian Murphey <crackshotm...@gmail.com> wrote: > > [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST> > 22:16:59,629 <http://airmail.calendar/2016-01-12%2022:16:59%20PST>] TRACE > [Controller 925537]: leader imbalance ratio for broker 925537 is 0.000000 > (kafka.controller.KafkaController) > > [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST> > 22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO > [SessionExpirationListener on 925537], ZK expired; shut down all controller > components and try to re-elect > (kafka.controller.KafkaController$SessionExpirationListener) > > [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST> > 22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO > [delete-topics-thread-925537], Shutting down > (kafka.controller.TopicDeletionManager$DeleteTopicsThread) > > [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST> > 22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO > [delete-topics-thread-925537], Shutdown completed > (kafka.controller.TopicDeletionManager$DeleteTopicsThread) > > [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST> > 22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO > [delete-topics-thread-925537], Stopped > (kafka.controller.TopicDeletionManager$Del > > This occurs very frequently, even after clean slating kafka. This is > something that never occurs in our production env. I've read here and there > that it could be a GC issue? Here is the tail end of recent GC log. > > > 20534K(8354560K), 52.5293140 secs] [Times: user=209.09 sys=0.06, > real=52.53 secs] > > 2016-01-11T23:16:05.149+0000: 784.219: [GC 784.219: [ParNew: > 274263K->1685K(306688K), 54.8993730 secs] 793174K->520803K(8354560K), > 54.8994450 secs] [Times: user=218.86 sys=0.03, real=54.90 secs] > > 2016-01-11T23:17:01.095+0000: 840.165: [GC 840.165: [ParNew: > 274325K->1896K(306688K), 56.4208930 secs] 793443K->521139K(8354560K), > 56.4209750 secs] [Times: user=224.88 sys=0.05, real=56.42 secs] > > 2016-01-11T23:17:59.024+0000: 898.093: [GC 898.093: [ParNew: > 274536K->1705K(306688K), 58.1100630 secs] 793779K->521093K(8354560K), > 58.1101400 secs] [Times: user=231.75 sys=0.05, real=58.12 secs] > > 2016-01-11T23:18:58.240+0000: 957.310: [GC 957.310: [ParNew: > 274345K->1483K(306688K), 64.2820420 secs] 793733K->521047K(8354560K), > 64.2821180 secs] [Times: user=241.93 sys=0.06, real=64.28 secs] > > 2016-01-11T23:20:03.571+0000: 1022.640: [GC 1022.640: [ParNew: > 274123K->1379K(306688K), 61.5305280 secs] 793687K->521097K(8354560K), > 61.5305990 secs] [Times: user=245.72 sys=0.01, real=61.53 secs] > > 2016-01-11T23:21:06.194+0000: 1085.263: [GC 1085.263: [ParNew: > 274019K->1508K(306688K), 63.4433440 secs] 793737K->521372K(8354560K), > 63.4434240 secs] [Times: user=253.33 sys=0.02, real=63.44 secs] > > 2016-01-11T23:22:10.413+0000: 1149.482: [GC 1149.483: [ParNew: > 274148K->1313K(306688K), 65.6956010 secs] 794012K->521330K(8354560K), > 65.6956660 secs] [Times: user=262.01 sys=0.05, real=65.69 secs] > > Heap > > par new generation total 306688K, used 132112K [0x00000005f5a00000, > 0x000000060a6c0000, 0x000000060a6c0000) > > eden space 272640K, 47% used [0x00000005f5a00000, 0x00000005fd9bbba0, > 0x0000000606440000) > > from space 34048K, 3% used [0x0000000606440000, 0x00000006065884a8, > 0x0000000608580000) > > to space 34048K, 0% used [0x0000000608580000, 0x0000000608580000, > 0x000000060a6c0000) > > concurrent mark-sweep generation total 8047872K, used 520016K > [0x000000060a6c0000, 0x00000007f5a00000, 0x00000007f5a00000) > > concurrent-mark-sweep perm gen total 38760K, used 25768K > [0x00000007f5a00000, 0x00000007f7fda000, 0x0000000800000000) > > > > On Tue, Jan 12, 2016 at 6:34 PM, Mayuresh Gharat < > gharatmayures...@gmail.com> wrote: > >> Can you paste the logs? >> >> Thanks, >> >> Mayuresh >> >> On Tue, Jan 12, 2016 at 4:58 PM, Dillian Murphey <crackshotm...@gmail.com >> > >> wrote: >> >> > Possibly running more stable with 1.7 JVM. >> > >> > Can someone explain the Zookeeper session? SHould it never expire, >> unless >> > the broker becomes unresponsive? I set a massive timeout value in the >> > broker config far beyond the amount of time I see the zk expiration. Is >> > this entirely on the kafka side, or could zookeeper be doing something? >> > From my zk logs I didn't see anything unusual, just exceptions as a >> result >> > of the zk session expiring (my guess). >> > >> > tnx >> > >> > On Tue, Jan 12, 2016 at 3:05 PM, Dillian Murphey < >> crackshotm...@gmail.com> >> > wrote: >> > >> > > Our 2 node kafka cluster has become unhealthy. We're running >> zookeeper >> > as >> > > a 3 node system, which very light load. >> > > >> > > What seems to be happening is in the controller log we get a ZK >> session >> > > expire message, and in the process of re-assigning the leader for the >> > > partitions (if I'm understanding this right, please correct me), the >> > broker >> > > goes offline and it interrupts our applications that are publishing >> > > messages. >> > > >> > > We don't see this in production, and kafka has been stable for months, >> > > since september. >> > > >> > > I've searched a lot and found some similiar complaints but no real >> > > solutions. >> > > >> > > I'm running 0.8.2 and JVM 1.6.X on ubuntu. >> > > >> > > Thanks for any ideas at all. >> > > >> > > >> > >> >> >> >> -- >> -Regards, >> Mayuresh R. Gharat >> (862) 250-7125 >> > >