Hello Justin, We were thinking the same thing with the JRuby workers. Perhaps lowering them back to 4, and lowering the heap size back to three, which worked fine before now that we have added 2 more Puppet servers. The behavior we see is failing Puppet runs like this on random modules.
Could not evaluate: Could not retrieve file metadata for puppet:///modules/modulename/resourcename: SSL_connect returned=6 errno=0 state=unknown state Something took far too long to answer is our guess. Reports are fine, PuppetDB is fine. Things always make it there. We see the failures. It is likely we have herd that comes in and sometimes makes the situation worse. Based on this fact, this should happen every 30 minutes. It doesn't. I can't think of any other server settings we are managing besides JRuby instances, heap size, and a tmp dir for Java to work with (/tmp is noexec here). We are using this JVM, and we do not have any custom tuning. openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) We will try and mess with the JRuby/heap ratio now that we have more Puppetservers. We consistently see all JRuby instances being utilized even when being set a 6. Another thing we may consider is doing Puppet runs every 45 mins instead of 30. This will lower load as well. Thanks for your thoughts, Mike On Monday, February 11, 2019 at 5:21:06 PM UTC-6, Justin Stoller wrote: > > > > On Mon, Feb 11, 2019 at 5:42 AM Mike Sharpton <shar...@gmail.com > <javascript:>> wrote: > >> Hey all, >> >> We have recently upgraded our environment from Puppetserver 4.2.2 to >> Puppetserver 6.0.2. We are running a mix of Puppet 4 and Puppet 6 agents >> until we can get them all upgraded to 6. We have around 6000 nodes, and we >> had 4 Puppetservers, but we added two more due to capacity issues with >> Puppet 6. The load is MUCH higher with Puppet 6. To the question, I am >> seeing longer and longer agent run times after about two days of the >> services running. The only error in the logs that seems to have any >> relation to this is this string. >> >> 2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] >> Internal Server Error: java.io.IOException: >> java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms >> >> >> After I restart the puppetserver service, this goes away for about two >> days. I think Puppetserver is dying a slow death under this load (load >> average of around 5-6). We are running Puppetserver on vm's that are >> 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap. I have >> not seen any OOM exceptions and the process never crashes. Has anyone else >> seen anything like this? I did some Googling and didn't find a ton of >> relevant stuff. Perhaps we need to upgrade to the latest version to see if >> this helps? Even more capacity? Seems silly. Thanks in advance! >> > > Off the top of my head: > 1. Have you tried lowering the JRuby workers to JVM heap ratio? (I would > try 1G to 1worker to see if it really is worker performance) > 2. That error is most likely from Jetty (it can be tuned with > idle-timeout-milliseconds[1]). Are agent runs failing with a 500 from the > server when that happens? Are clients failing to post their facts or > reports in a timely manner? Is Puppet Server failing its connections to > PuppetDB? > 3. Are you managing any other server settings? Having a low > max-requests-per-instance is problematic for newer servers (they more > aggressively compile/optimize the Ruby code the worker loads, so with > shorter lifetimes it does a bunch of work to then throw it a way and start > over - and that can cause much more load). > 4. What version of java are you using/do you have any custom tuning of > Java that maybe doesn't work well with newer servers? Server 5+ only has > support for Java 8 and will use more non-heap memory/code cache for those > new optimizations mentioned above. > > HTH, > Justin > > > 1. > https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/blob/master/doc/jetty-config.md#idle-timeout-milliseconds > > >> Mike >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to puppet-users...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com >> >> <https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/6ddf0aac-f98f-4f84-83ce-41fcb462051c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.