How did you arrive at the 10 GB JVM heap value? I'm running Kafka on 16 GB RAM instances with ~4000 partitions each and only assigning 5 GB to JVM of which Kafka only seems to be using ~2 GB at any given time.
Also, I've set vm.max_map_count to 262144 -- didn't use any formula to estimate that, must have been some answer I found online, but it's been doing its trick -- no issues so far. On Fri, Sep 27, 2019 at 11:29 AM Arpit Gogia <ar...@ixigo.com> wrote: > Hello Kafka user group > > > I am running a Kafka cluster with 3 brokers and have been experiencing > frequent OutOfMemory errors each time with similar error stack trace > > > java.io.IOException: Map failed > > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938) > > at > kafka.log.AbstractIndex$$anonfun$resize$1.apply$mcZ$sp(AbstractIndex.scala:188) > > at > kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:173) > > at > kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:173) > > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) > > at kafka.log.AbstractIndex.resize(AbstractIndex.scala:173) > > at > kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply$mcZ$sp(AbstractIndex.scala:242) > > at > kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:242) > > at > kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:242) > > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) > > at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:241) > > at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:501) > > at > kafka.log.Log$$anonfun$roll$2$$anonfun$apply$32.apply(Log.scala:1635) > > at > kafka.log.Log$$anonfun$roll$2$$anonfun$apply$32.apply(Log.scala:1635) > > at scala.Option.foreach(Option.scala:257) > > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1635) > > at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1599) > > at kafka.log.Log.maybeHandleIOException(Log.scala:1996) > > at kafka.log.Log.roll(Log.scala:1599) > > at kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1434) > > at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1429) > > at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1429) > > at kafka.log.Log.maybeHandleIOException(Log.scala:1996) > > at kafka.log.Log.deleteSegments(Log.scala:1429) > > at kafka.log.Log.deleteOldSegments(Log.scala:1424) > > at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1501) > > at kafka.log.Log.deleteOldSegments(Log.scala:1492) > > at > kafka.log.LogCleaner$CleanerThread$$anonfun$cleanFilthiestLog$1.apply(LogCleaner.scala:328) > > at > kafka.log.LogCleaner$CleanerThread$$anonfun$cleanFilthiestLog$1.apply(LogCleaner.scala:324) > > at scala.collection.immutable.List.foreach(List.scala:392) > > at > kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:324) > > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300) > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > Caused by: java.lang.OutOfMemoryError: Map failed > > at sun.nio.ch.FileChannelImpl.map0(Native Method) > > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935) > > ... 32 more > > > Each broker possesses 16 GB of memory out of which 10 GB is allotted to > the JVM as heap. Total partition count on each broker is approximately 2000 > with an average partition size of 300 MB. > > > After looking around, I found out that increasing the OS level memory map > area limit `vm.max_map_count` is a viable solution, since Kafka memory > map’s segment files while rolling over and the above stack trace indicates > a failure in doing that. Since then I have increased this every time a > broker goes down with this error. Currently I am at 250,000 on two brokers > and 200,000 on one, which is very high considering the estimation formula > mentioned at https://kafka.apache.org/documentation/#os. Most recently I > started to monitor the memory map file count (using /proc/<pid>/maps) of > the Kafka process on each broker, below is the graph. > > > [image: Screenshot 2019-09-27 at 12.02.38 PM.png] > > > My concern is that this value is on an overall increasing trend, with an > average increase of 27.7K across brokers in the roughly 2 days of > monitoring. > > > Following are my questions: > > 1. Will I have to keep incrementing `vm.max_map_count` till I arrive > at a stable value? > 2. Could this by any chance indicate a memory leak? Maybe in the > subroutine that rolls over segment files? > 3. Could the lack of page cache memory be a cause as well? Volume of > cached memory seems to remain consistent across time so it doesn’t appear > to be a suspect by I am not ruling it out for now. As a mitigation I will > be decreasing the JVM heap next time so that more memory is available for > page cache. > > -- > > *Arpit Gogia **|* *Data Engineer* >