[ https://issues.apache.org/jira/browse/KAFKA-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus B updated KAFKA-5377: ---------------------------- Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data) 2 broker cluster JAVA 8 (131) was: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data) 2 broker cluster > Kafka server process crashing due to access violation (caused by log cleaner) > ----------------------------------------------------------------------------- > > Key: KAFKA-5377 > URL: https://issues.apache.org/jira/browse/KAFKA-5377 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM > 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data) > 2 broker cluster > JAVA 8 (131) > Reporter: Markus B > Labels: windows > Attachments: hs_err_pid15944.log, hs_err_pid6304.log, > hs_err_pid7356.log, hs_err_pid9056.log, hs_err_pid9276.log, > java_error7192.log, server.1.properties > > > We are running Kafka in a 2 x broker cluster configuration on Windows, and > overall it has been working well for us. We have been seeing occasional > issues where the broker crashes first on one node, and then almost > immediately on the second. When we go and try to re-start, the broker > continues to crash during startup until we fix the issue that caused the > crash. > I finally figured out that the root cause of the startup crashes were a bad > set of files in __consumer_offsets-2 (in this latest case, which offset is > the cause varies). Once I deleted the bad files, the broker started up > correctly again. > From what I can tell, looking at both code, crash dump files, and log files, > it is all happening because of the log cleaner, and I can pinpoint it down in > most (if not all) cases to TimeIndex. The java dump file indicates some kind > of an access violation, but I am not sure when/how that is happening. It > seems like the initial crashes happen during the compacting/swapping action, > and then the startups fail when they try to access the bad files > (TimeIndex.parse()). > I am attaching dump files from two separate instances of when it initially > crashed, and then when we try to restart. Also including the broker config > settings that we are using. > I'm not sure what additional information to provide, but I can add more if > needed. > Any help, suggestions or input would be very appreciated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)