Bill Bejeck created KAFKA-20616:
-----------------------------------

             Summary: Close-path leaks in RocksDBStore cause native memory 
growth that eventually leads to OOM
                 Key: KAFKA-20616
                 URL: https://issues.apache.org/jira/browse/KAFKA-20616
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 4.3.0, 4.4.0
            Reporter: Bill Bejeck
            Assignee: Bill Bejeck
             Fix For: 4.3.1, 4.4.0


Primary leak (KAFKA-20456 follow-up). RocksDBStore.createOffsetsCFOptions() 
returns a new ColumnFamilyOptions() that is passed to a ColumnFamilyDescriptor 
and then dropped — it is never assigned to a field
   and never closed. On the JNI side, constructing a ColumnFamilyOptions 
auto-allocates a default BlockBasedTableFactory with an 8 MB LRUCache. Native 
heap profiles from the soak confirm this directly:      
  Java_org_rocksdb_ColumnFamilyOptions_newColumnFamilyOptions → 
BlockBasedTableFactory::InitializeOptions → LRUCacheOptions::MakeSharedCache 
accounts for 5.5 GB (70%) on soak1 and 2.6 GB (54%) on soak2. The 
  leak compounds per segment, per task — windowed/segmented stores amplify it 
heavily.                                                                        
                                                 
                                                                                
                                                                                
                                               
  Secondary leak (KIP-1035 close path). AbstractColumnFamilyAccessor.close() 
writes a closedState marker to the offsets CF; if that write throws (which 
happens during the EOSv2 cascade or unclean shutdown — 
  a case the existing code comment already acknowledges), the subsequent 
offsetColumnFamilyHandle.close() is skipped. SingleColumnFamilyAccessor.close() 
and DualColumnFamilyAccessor.close() have the same    
  non-finally ordering, so the data CF (and oldCF/newCF for migrating stores) 
handles also leak whenever super.close() propagates. RocksDBStore.close() 
swallows the resulting RocksDBException, so the leak is
   silent. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to