[ https://issues.apache.org/jira/browse/KAFKA-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus B updated KAFKA-5377: ---------------------------- Description: We are running Kafka in a 2 x broker cluster configuration on Windows, and overall it has been working well for us. We have been seeing occasional issues where the broker crashes first on one node, and then almost immediately on the second. When we go and try to re-start, the broker continues to crash during startup until we fix the issue that caused the crash. I will add that we took the source code for 0.10.2.1 and had to slightly update it as Kafka on Windows has one bug - it does not properly clean up old files when rollover occurs (I had to modify the renameTo function in AbastractIndex, see attached). Without the change, Kafka runs fine (no crashes), but unfortunately it does not clean up log files because it still has the files locked when it tries to roll them over. The only change was in two files (attatched) to address not unmapping files on windows before moving. After this change, topic files (index, timeindex, and log) are cleaned up properly, but we then see intermittent crash issues due to memory access violation (which I assume is related to the change to unmap files before rolling over) In our test we are running 210 consumers/producers with message rate of ~10 msg/second. It keeps up with the messages without issues but the crashing of the broker is a problem. - The kafka data files are written to E/F drives, and there is 200GB+ free space on either. - The log files are stored in D drive with 200GB free space as well. - C drive just hosts the software - no log files or data files written here (java was by default writing memory dumps here, but we have updated to have it write them to D drive and we now also clean them up as well, as they are large since we are running the broker with a large heap size). >From what I can tell, looking at both code, crash dump files, and log files, >it is all happening because of the log cleaner, and I can pinpoint it down in >most (if not all) cases to TimeIndex. The java dump file indicates some kind >of an access violation, but I am not sure when/how that is happening. It seems >like the initial crashes happen during the compacting/swapping action, and >then the startups fail when they try to access the bad files >(TimeIndex.parse()). I finally figured out that the root cause of the startup >crashes were a bad set of files in __consumer_offsets-x (which offset varies). >Once I deleted the bad files, the broker started up correctly again. It also >appears that the logcleaner is the place where runtime and startup crash >occurs, when it is compressing/swapping files for the consumer offsets. I am attaching dump files from two separate instances of when it initially crashed, and then when we try to restart. Also including the broker config settings that we are using. I'm not sure what additional information to provide, but I can add more if needed. Any help, suggestions or input would be very appreciated. was: We are running Kafka in a 2 x broker cluster configuration on Windows, and overall it has been working well for us. We have been seeing occasional issues where the broker crashes first on one node, and then almost immediately on the second. When we go and try to re-start, the broker continues to crash during startup until we fix the issue that caused the crash. I finally figured out that the root cause of the startup crashes were a bad set of files in __consumer_offsets-2 (in this latest case, which offset is the cause varies). Once I deleted the bad files, the broker started up correctly again. In our test we are running 210 consumers/producers with message rate of ~10 msg/second. It keeps up with the messages without issues but the crashing of the broker is a problem. - The kafka data files are written to E/F drives, and there is 200GB+ free space on either. - The log files are stored in D drive with 200GB free space as well. - C drive just hosts the software - no log files or data files written here (java was by default writing memory dumps here, but we have updated to have it write them to D drive and we now also clean them up as well, as they are large since we are running the broker with a large heap size). >From what I can tell, looking at both code, crash dump files, and log files, >it is all happening because of the log cleaner, and I can pinpoint it down in >most (if not all) cases to TimeIndex. The java dump file indicates some kind >of an access violation, but I am not sure when/how that is happening. It seems >like the initial crashes happen during the compacting/swapping action, and >then the startups fail when they try to access the bad files >(TimeIndex.parse()). I am attaching dump files from two separate instances of when it initially crashed, and then when we try to restart. Also including the broker config settings that we are using. I'm not sure what additional information to provide, but I can add more if needed. Any help, suggestions or input would be very appreciated. > Kafka server process crashing due to access violation (caused by log cleaner) > ----------------------------------------------------------------------------- > > Key: KAFKA-5377 > URL: https://issues.apache.org/jira/browse/KAFKA-5377 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM > 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data) > 2 broker cluster > JAVA 8 (131) > Reporter: Markus B > Labels: windows > Attachments: AbstractIndex.scala, hs_err_pid15944.log, > hs_err_pid6304.log, hs_err_pid7356.log, hs_err_pid9056.log, > hs_err_pid9276.log, java_error7192.log, LogSegment.scala, server.1.properties > > > We are running Kafka in a 2 x broker cluster configuration on Windows, and > overall it has been working well for us. We have been seeing occasional > issues where the broker crashes first on one node, and then almost > immediately on the second. When we go and try to re-start, the broker > continues to crash during startup until we fix the issue that caused the > crash. > I will add that we took the source code for 0.10.2.1 and had to slightly > update it as Kafka on Windows has one bug - it does not properly clean up old > files when rollover occurs (I had to modify the renameTo function in > AbastractIndex, see attached). Without the change, Kafka runs fine (no > crashes), but unfortunately it does not clean up log files because it still > has the files locked when it tries to roll them over. The only change was in > two files (attatched) to address not unmapping files on windows before > moving. After this change, topic files (index, timeindex, and log) are > cleaned up properly, but we then see intermittent crash issues due to memory > access violation (which I assume is related to the change to unmap files > before rolling over) > In our test we are running 210 consumers/producers with message rate of ~10 > msg/second. It keeps up with the messages without issues but the crashing of > the broker is a problem. > - The kafka data files are written to E/F drives, and there is 200GB+ free > space on either. > - The log files are stored in D drive with 200GB free space as well. > - C drive just hosts the software - no log files or data files written here > (java was by default writing memory dumps here, but we have updated to have > it write them to D drive and we now also clean them up as well, as they are > large since we are running the broker with a large heap size). > From what I can tell, looking at both code, crash dump files, and log files, > it is all happening because of the log cleaner, and I can pinpoint it down in > most (if not all) cases to TimeIndex. The java dump file indicates some kind > of an access violation, but I am not sure when/how that is happening. It > seems like the initial crashes happen during the compacting/swapping action, > and then the startups fail when they try to access the bad files > (TimeIndex.parse()). I finally figured out that the root cause of the startup > crashes were a bad set of files in __consumer_offsets-x (which offset > varies). Once I deleted the bad files, the broker started up correctly again. > It also appears that the logcleaner is the place where runtime and startup > crash occurs, when it is compressing/swapping files for the consumer offsets. > I am attaching dump files from two separate instances of when it initially > crashed, and then when we try to restart. Also including the broker config > settings that we are using. > I'm not sure what additional information to provide, but I can add more if > needed. > Any help, suggestions or input would be very appreciated. -- This message was sent by Atlassian JIRA (v6.3.15#6346)