[ 
https://issues.apache.org/jira/browse/CASSANDRA-19785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874184#comment-17874184
 ] 

Brandon Williams commented on CASSANDRA-19785:
----------------------------------------------

Thanks for the patch, Benedict! I'll do what I can to move this along.  Here's 
CI:

||Branch||CI||
|[4.0|https://github.com/driftx/cassandra/tree/CASSANDRA-19785-4.0]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/1707/workflows/8f339da4-78ee-4024-81c9-7b26c940a93d],
 
[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/1707/workflows/22f1cd83-e563-48c7-9877-febb5c617067]|
|[4.1|https://github.com/driftx/cassandra/tree/CASSANDRA-19785-4.1]|[j8|https://app.circleci.com/pipelines/github/driftx/cassandra/1706/workflows/fb5cb598-f96f-4785-a22b-0a808bde7814],
 
[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/1706/workflows/7c086e14-dfac-47c5-97ce-af9d3216149a]|
|[5.0|https://github.com/driftx/cassandra/tree/CASSANDRA-19785-5.0]|[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/1705/workflows/e0ecc857-0157-447a-8851-27f65e6a2f9c],
 
[j17|https://app.circleci.com/pipelines/github/driftx/cassandra/1705/workflows/f3be8609-490a-459e-b080-2a3b11481a2d]|
|[trunk|https://github.com/driftx/cassandra/tree/CASSANDRA-19785-trunk]|[j11|https://app.circleci.com/pipelines/github/driftx/cassandra/1708/workflows/c3067071-82a3-4122-9af3-bb448ed481fb],
 
[j17|https://app.circleci.com/pipelines/github/driftx/cassandra/1708/workflows/f825016f-0342-45b7-ad17-db745e51c341]|


> Possible memory leak in BTree.FastBuilder 
> ------------------------------------------
>
>                 Key: CASSANDRA-19785
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19785
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Paul Chandler
>            Priority: Normal
>             Fix For: 4.0.x
>
>         Attachments: image-2024-07-19-08-44-56-714.png, 
> image-2024-07-19-08-45-17-289.png, image-2024-07-19-08-45-33-933.png, 
> image-2024-07-19-08-45-50-383.png, image-2024-07-19-08-46-06-919.png, 
> image-2024-07-19-08-46-42-979.png, image-2024-07-19-08-46-56-594.png, 
> image-2024-07-19-08-47-19-517.png, image-2024-07-19-08-47-34-582.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are having a problem with the heap growing in size, This is a large 
> cluster > 1,000 nodes across a large number of dc’s. This is running version 
> 4.0.11.
>  
> Each node has a 32GB heap, and the amount used continues to grow until it 
> reaches 30GB, it then struggles with multiple Full GC pauses, as can be seen 
> here:
> !image-2024-07-19-08-44-56-714.png!
> We took 2 heap dumps on one node a few days after it was restarted, and the 
> heap had grown by 2.7GB
>  
> 9{^}th{^} July
> !image-2024-07-19-08-45-17-289.png!
> 11{^}th{^} July
> !image-2024-07-19-08-45-33-933.png!
> This can be seen as mainly an increase of memory used by 
> FastThreadLocalThread, increasing from 5.92GB to 8.53GB
> !image-2024-07-19-08-45-50-383.png!
> !image-2024-07-19-08-46-06-919.png!
> Looking deeper into this it can be seen that the growing heap is contained 
> within the threads for the MutationStage, Native-transport-Requests, 
> ReadStage etc. We would expect the memory used within these threads to be 
> short lived, and not grow as time goes on.  We recently increased the size of 
> theses threadpools, and that has increased the size of the problem.
>  
> Top memory usage for FastThreadLocalThread
> 9{^}th{^} July
> !image-2024-07-19-08-46-42-979.png!
> 11{^}th{^} July
> !image-2024-07-19-08-46-56-594.png!
> This has led us to investigate whether there could be a memory leak, and we 
> have found the following issues within the retained references in 
> BTree.FastBuilder objects. The issue appears to stem from the reset() method, 
> which does not properly clear all buffers.  We are not really sure how the 
> BTree.FastBuilder works, but this this is our analysis of where a leak might 
> occur.
>  
> Specifically:
> Leaf Buffer Not Being Cleared:
> When leaf().count is 0, the statement Arrays.fill(leaf().buffer, 0, 
> leaf().count, null); does not clear the buffer because the end index is 0. 
> This leaves the buffer with references to potentially large objects, 
> preventing garbage collection and increasing heap usage.
> Branch inUse Property:
> If the inUse property of the branch is set to false elsewhere in the code, 
> the while loop while (branch != null && branch.inUse) does not execute, 
> resulting in uncleared branch buffers and retained references.
>  
> This is based on the following observations:
>     Heap Dumps: Analysis of heap dumps shows that leaf().count is often 0, 
> and as a result, the buffer is not being cleared, leading to high heap 
> utilization.
> !image-2024-07-19-08-47-19-517.png!
>     Remote Debugging: Debugging sessions indicate that the drain() method 
> sets count to 0, and the inUse flag for the parent branch is set to false, 
> preventing the while loop in reset() from clearing the branch buffers.
> !image-2024-07-19-08-47-34-582.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to