[
https://issues.apache.org/jira/browse/KAFKA-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus B updated KAFKA-5377:
----------------------------
Attachment: LogSegment.scala
AbstractIndex.scala
Modified two files to address bug on Windows where renameTo and
changeFileSuffixes does not work (file needs to be unmapped before it can be
renamed).
> Kafka server process crashing due to access violation (caused by log cleaner)
> -----------------------------------------------------------------------------
>
> Key: KAFKA-5377
> URL: https://issues.apache.org/jira/browse/KAFKA-5377
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.2.0, 0.10.2.1
> Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM
> 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data)
> 2 broker cluster
> JAVA 8 (131)
> Reporter: Markus B
> Labels: windows
> Attachments: AbstractIndex.scala, hs_err_pid15944.log,
> hs_err_pid6304.log, hs_err_pid7356.log, hs_err_pid9056.log,
> hs_err_pid9276.log, java_error7192.log, LogSegment.scala, server.1.properties
>
>
> We are running Kafka in a 2 x broker cluster configuration on Windows, and
> overall it has been working well for us. We have been seeing occasional
> issues where the broker crashes first on one node, and then almost
> immediately on the second. When we go and try to re-start, the broker
> continues to crash during startup until we fix the issue that caused the
> crash.
> I finally figured out that the root cause of the startup crashes were a bad
> set of files in __consumer_offsets-2 (in this latest case, which offset is
> the cause varies). Once I deleted the bad files, the broker started up
> correctly again.
> In our test we are running 210 consumers/producers with message rate of ~10
> msg/second. It keeps up with the messages without issues but the crashing of
> the broker is a problem.
> - The kafka data files are written to E/F drives, and there is 200GB+ free
> space on either.
> - The log files are stored in D drive with 200GB free space as well.
> - C drive just hosts the software - no log files or data files written here
> (java was by default writing memory dumps here, but we have updated to have
> it write them to D drive and we now also clean them up as well, as they are
> large since we are running the broker with a large heap size).
> From what I can tell, looking at both code, crash dump files, and log files,
> it is all happening because of the log cleaner, and I can pinpoint it down in
> most (if not all) cases to TimeIndex. The java dump file indicates some kind
> of an access violation, but I am not sure when/how that is happening. It
> seems like the initial crashes happen during the compacting/swapping action,
> and then the startups fail when they try to access the bad files
> (TimeIndex.parse()).
> I am attaching dump files from two separate instances of when it initially
> crashed, and then when we try to restart. Also including the broker config
> settings that we are using.
> I'm not sure what additional information to provide, but I can add more if
> needed.
> Any help, suggestions or input would be very appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)