[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243765#comment-16243765
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10136:
---------------------------------------------

rhtyd opened a new pull request #2314: CLOUDSTACK-10136: Fix RemoteHostEndPoint 
thread growth
URL: https://github.com/apache/cloudstack/pull/2314
 
 
   This fixes the following:
   - Unchecked thread growth in RemoteEndHostEndPoint
   - Potential NPE while finding EP for a storage/scope
   
   Unbounded thread growth can be reproduced with following findings:
   - Every unreachable template would produce 6 new threads (in a single
   ScheduledExecutorService instance) spaced by 10 seconds
   - Every reachable template url without the template would produce 1 new
   thread (and one ScheduledExecutorService instance), it errors out quickly 
without
   causing more thread growth.
   - Every valid url will produce upto 10 threads as the same ep (endpoint
   instance) will be reused to query upload/download (async callback)
   progresses.
   
   Every RemoteHostEndPoint instances creates its own
   ScheduledExecutorService instance which is why in the jstack dump, we
   see several threads that share the prefix RemoteHostEndPoint-{1..10}
   (given poolsize is defined as 10, it uses suffixes 1-10).
   
   This fixes the discovered thread leakage with following notes:
   - Instead of ScheduledExecutorService instance, a cached pool could be
   used instead and was implemented, and with `static` scope to be reused
   among other future RemoteHostEndPoint instances.
   - It was not clear why we would want to wait when we've Answers returned
   from the remote EP, and therefore a scheduled/delayed Runnable was
   not required at all for processing answers. ScheduledExecutorService
   was therefore not really required, moved to ExecutorService instead.
   - Another benefit of using a cached pool is that it will shutdown
   threads if they are not used in 60 seconds, and they get re-used for
   future runnable submissions.
   - Caveat: the executor service is still unbounded, however, the use-case
   that this method is used for short jobs to check upload/download
   progresses fits the case here.
   - Refactored CmdRunner to not use/reference objects from parent class.
   
   Screenshots showing deterministic thread growth for template with an 
invalid/unreachable URL:
   ![screenshot from 2017-11-08 
13-40-59](https://user-images.githubusercontent.com/95203/32542409-893496a6-c498-11e7-8afb-1b2e1a46e710.png)
   
   Screenshot showing threads transitioning from waiting->stopped (and re-use) 
with this fix:
   ![screenshot from 2017-11-08 
14-49-10](https://user-images.githubusercontent.com/95203/32542430-996e0638-c498-11e7-89d9-432b2d0afa89.png)
   
   To verify, the following can be tried:
   - Before applying this fix, in a test environment register two template such 
that (1) one has a reachable IP/domain but the resource does not exist (causing 
404) and (2) the second template uses a domain/IP that is not reachable at all
   - Thread growths can be checked using: `jstack -l <mgmt server PID> | grep 
RemoteHostEndPoint`, or using a visual tool such as VisualVM etc.
   - With the fix + restart, the mgmt server will reattempt to download those 
template, and a humungous thread growth won't be seen and after say 2-4 minutes 
all the threads should shutdown, and  `jstack -l <mgmt server PID> | grep 
RemoteHostEndPoint` will show no threads.
   
   Pinging for review - @DaanHoogland @nvazquez @borisstoyanov @PaulAngus @wido 
@mlsorensen @marcaurele and others
   
   @blueorangutan package

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix thread growth/leak issue
> ----------------------------
>
>                 Key: CLOUDSTACK-10136
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10136
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>    Affects Versions: 4.5.2, 4.6.2, 4.7.1, 4.10.0.0, 4.9.2.0, 4.8.1.1, 4.9.3.0
>            Reporter: Rohit Yadav
>            Assignee: Rohit Yadav
>             Fix For: 4.11.0.0
>
>
> For long running mgmt server with large amounts of templates etc, large 
> amounts of waiting threads are seen that start with the 'RemoteHostEndPoint-' 
> prefix. These async threads are responsible mostly for checking 
> template/volume upload/download progress/states. They kick everytime a 
> template is being checked/downloaded setup etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to