Hello Kafka user group

I am running a Kafka cluster with 3 brokers and have been experiencing
frequent OutOfMemory errors each time with similar error stack trace


java.io.IOException: Map failed

    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)

    at
kafka.log.AbstractIndex$$anonfun$resize$1.apply$mcZ$sp(AbstractIndex.scala:188)

    at
kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:173)

    at
kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:173)

    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)

    at kafka.log.AbstractIndex.resize(AbstractIndex.scala:173)

    at
kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply$mcZ$sp(AbstractIndex.scala:242)

    at
kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:242)

    at
kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:242)

    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)

    at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:241)

    at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:501)

    at kafka.log.Log$$anonfun$roll$2$$anonfun$apply$32.apply(Log.scala:1635)

    at kafka.log.Log$$anonfun$roll$2$$anonfun$apply$32.apply(Log.scala:1635)

    at scala.Option.foreach(Option.scala:257)

    at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1635)

    at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1599)

    at kafka.log.Log.maybeHandleIOException(Log.scala:1996)

    at kafka.log.Log.roll(Log.scala:1599)

    at kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1434)

    at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1429)

    at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1429)

    at kafka.log.Log.maybeHandleIOException(Log.scala:1996)

    at kafka.log.Log.deleteSegments(Log.scala:1429)

    at kafka.log.Log.deleteOldSegments(Log.scala:1424)

    at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1501)

    at kafka.log.Log.deleteOldSegments(Log.scala:1492)

    at
kafka.log.LogCleaner$CleanerThread$$anonfun$cleanFilthiestLog$1.apply(LogCleaner.scala:328)

    at
kafka.log.LogCleaner$CleanerThread$$anonfun$cleanFilthiestLog$1.apply(LogCleaner.scala:324)

    at scala.collection.immutable.List.foreach(List.scala:392)

    at
kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:324)

    at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300)

    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

Caused by: java.lang.OutOfMemoryError: Map failed

    at sun.nio.ch.FileChannelImpl.map0(Native Method)

    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)

    ... 32 more


Each broker possesses 16 GB of memory out of which 10 GB is allotted to the
JVM as heap. Total partition count on each broker is approximately 2000
with an average partition size of 300 MB.


After looking around, I found out that increasing the OS level memory map
area limit `vm.max_map_count` is a viable solution, since Kafka memory
map’s segment files while rolling over and the above stack trace indicates
a failure in doing that. Since then I have increased this every time a
broker goes down with this error. Currently I am at 250,000 on two brokers
and 200,000 on one, which is very high considering the estimation formula
mentioned at https://kafka.apache.org/documentation/#os. Most recently I
started to monitor the memory map file count (using /proc/<pid>/maps) of
the Kafka process on each broker, below is the graph.


[image: Screenshot 2019-09-27 at 12.02.38 PM.png]


My concern is that this value is on an overall increasing trend, with an
average increase of 27.7K across brokers in the roughly 2 days of
monitoring.


Following are my questions:

   1. Will I have to keep incrementing `vm.max_map_count` till I arrive at
   a stable value?
   2. Could this by any chance indicate a memory leak? Maybe in the
   subroutine that rolls over segment files?
   3. Could the lack of page cache memory be a cause as well? Volume of
   cached memory seems to remain consistent across time so it doesn’t appear
   to be a suspect by I am not ruling it out for now. As a mitigation I will
   be decreasing the JVM heap next time so that more memory is available for
   page cache.

-- 

*Arpit Gogia **|* *Data Engineer*

Reply via email to