[ https://issues.apache.org/jira/browse/CLOUDSTACK-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243765#comment-16243765 ]
ASF GitHub Bot commented on CLOUDSTACK-10136: --------------------------------------------- rhtyd opened a new pull request #2314: CLOUDSTACK-10136: Fix RemoteHostEndPoint thread growth URL: https://github.com/apache/cloudstack/pull/2314 This fixes the following: - Unchecked thread growth in RemoteEndHostEndPoint - Potential NPE while finding EP for a storage/scope Unbounded thread growth can be reproduced with following findings: - Every unreachable template would produce 6 new threads (in a single ScheduledExecutorService instance) spaced by 10 seconds - Every reachable template url without the template would produce 1 new thread (and one ScheduledExecutorService instance), it errors out quickly without causing more thread growth. - Every valid url will produce upto 10 threads as the same ep (endpoint instance) will be reused to query upload/download (async callback) progresses. Every RemoteHostEndPoint instances creates its own ScheduledExecutorService instance which is why in the jstack dump, we see several threads that share the prefix RemoteHostEndPoint-{1..10} (given poolsize is defined as 10, it uses suffixes 1-10). This fixes the discovered thread leakage with following notes: - Instead of ScheduledExecutorService instance, a cached pool could be used instead and was implemented, and with `static` scope to be reused among other future RemoteHostEndPoint instances. - It was not clear why we would want to wait when we've Answers returned from the remote EP, and therefore a scheduled/delayed Runnable was not required at all for processing answers. ScheduledExecutorService was therefore not really required, moved to ExecutorService instead. - Another benefit of using a cached pool is that it will shutdown threads if they are not used in 60 seconds, and they get re-used for future runnable submissions. - Caveat: the executor service is still unbounded, however, the use-case that this method is used for short jobs to check upload/download progresses fits the case here. - Refactored CmdRunner to not use/reference objects from parent class. Screenshots showing deterministic thread growth for template with an invalid/unreachable URL:  Screenshot showing threads transitioning from waiting->stopped (and re-use) with this fix:  To verify, the following can be tried: - Before applying this fix, in a test environment register two template such that (1) one has a reachable IP/domain but the resource does not exist (causing 404) and (2) the second template uses a domain/IP that is not reachable at all - Thread growths can be checked using: `jstack -l <mgmt server PID> | grep RemoteHostEndPoint`, or using a visual tool such as VisualVM etc. - With the fix + restart, the mgmt server will reattempt to download those template, and a humungous thread growth won't be seen and after say 2-4 minutes all the threads should shutdown, and `jstack -l <mgmt server PID> | grep RemoteHostEndPoint` will show no threads. Pinging for review - @DaanHoogland @nvazquez @borisstoyanov @PaulAngus @wido @mlsorensen @marcaurele and others @blueorangutan package ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix thread growth/leak issue > ---------------------------- > > Key: CLOUDSTACK-10136 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10136 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Affects Versions: 4.5.2, 4.6.2, 4.7.1, 4.10.0.0, 4.9.2.0, 4.8.1.1, 4.9.3.0 > Reporter: Rohit Yadav > Assignee: Rohit Yadav > Fix For: 4.11.0.0 > > > For long running mgmt server with large amounts of templates etc, large > amounts of waiting threads are seen that start with the 'RemoteHostEndPoint-' > prefix. These async threads are responsible mostly for checking > template/volume upload/download progress/states. They kick everytime a > template is being checked/downloaded setup etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)