[PR] [SPARK-53446][CORE] Optimize BlockManager remove operations with cached block mappings [spark]

via GitHub Fri, 17 Oct 2025 01:22:52 -0700


zml1206 opened a new pull request, #52646:
URL: https://github.com/apache/spark/pull/52646


   ### What changes were proposed in this pull request?
   Continue #52210.
   Introduced three concurrent hash maps to track block ID associations for 
optimize BlockManager remove operations by introducing cached mappings to 
eliminate O(n) linear scans. 
   
   
   ### Why are the changes needed?
   Previously, removeRdd(), removeBroadcast(), and removeCache() required 
scanning all blocks in blockInfoManager.entries to find matches. This approach 
becomes a serious bottleneck when:
   
   1.Large block counts: In production deployments with millions or even tens 
of millions of cached blocks, linear scans can be prohibitively slow
   High cleanup frequency: Workloads that repeatedly create and discard RDDs or 
broadcast variables accumulate overhead quickly
   The original removeRdd() method already contained a TODO noting that an 
additional mapping would be needed to avoid linear scans. This PR implements 
that improvement.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No,
   
   ### How was this patch tested?
   Existing tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-53446][CORE] Optimize BlockManager remove operations with cached block mappings [spark]

Reply via email to