Bill Bejeck created KAFKA-20616:
-----------------------------------
Summary: Close-path leaks in RocksDBStore cause native memory
growth that eventually leads to OOM
Key: KAFKA-20616
URL: https://issues.apache.org/jira/browse/KAFKA-20616
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 4.3.0, 4.4.0
Reporter: Bill Bejeck
Assignee: Bill Bejeck
Fix For: 4.3.1, 4.4.0
Primary leak (KAFKA-20456 follow-up). RocksDBStore.createOffsetsCFOptions()
returns a new ColumnFamilyOptions() that is passed to a ColumnFamilyDescriptor
and then dropped — it is never assigned to a field
and never closed. On the JNI side, constructing a ColumnFamilyOptions
auto-allocates a default BlockBasedTableFactory with an 8 MB LRUCache. Native
heap profiles from the soak confirm this directly:
Java_org_rocksdb_ColumnFamilyOptions_newColumnFamilyOptions →
BlockBasedTableFactory::InitializeOptions → LRUCacheOptions::MakeSharedCache
accounts for 5.5 GB (70%) on soak1 and 2.6 GB (54%) on soak2. The
leak compounds per segment, per task — windowed/segmented stores amplify it
heavily.
Secondary leak (KIP-1035 close path). AbstractColumnFamilyAccessor.close()
writes a closedState marker to the offsets CF; if that write throws (which
happens during the EOSv2 cascade or unclean shutdown —
a case the existing code comment already acknowledges), the subsequent
offsetColumnFamilyHandle.close() is skipped. SingleColumnFamilyAccessor.close()
and DualColumnFamilyAccessor.close() have the same
non-finally ordering, so the data CF (and oldCF/newCF for migrating stores)
handles also leak whenever super.close() propagates. RocksDBStore.close()
swallows the resulting RocksDBException, so the leak is
silent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)