On 5/30/12 2:38 AM, "Pid" <p...@pidster.com> wrote:
>On 29/05/2012 20:50, Kevin wrote: >> >> >> On 5/29/12 11:09 AM, "Christopher Schultz" >><ch...@christopherschultz.net> >> wrote: >> >> Kevin, >> >> (FYI these are always a PITA to diagnose...) >> >> On 5/29/12 11:03 AM, Kevin wrote: >>>>> We have a cluster of tomcat servers being used on a very high >>>>> volume website. We've noticed that for the first 5-6 hours after an >>>>> application re-load that Full GC will run every 2 minutes pausing >>>>> the application for anywhere between 5 and 20 seconds. After 5-6 >>>>> hours the full GC will no longer run until tomcat is reloaded. >>>>> This behavior is repeatable. >>>>> >>>>> We discounted our EhCache settings as potential issue because >>>>> tomcat can be killed without this issue showing up. It's not until >>>>> we actually reload the WAR with changes (clearing the work >>>>> directory) and restart tomcat that this issue shows up. Traffic >>>>> level isn't a factor we'll go through our peak hours with no >>>>> problem. >> >> All good things to know. >> >>>>> The servers are all dual quad core with 32GB of RAM running Centos >>>>> 5 and tomcat 6.0.18. We've attempted every suggested GC setting >>>>> change suggested on the net and by developers and system >>>>> administrators. >> >> Did you try looking at what the cause might be, or just blindly apply >> suggestions to see if something would stick? >> >>> Initially we looked closely at it trying to solve the problem as time >>>wore >>> on we tried anything. This has been going on for months now. >> >> >> >>>>> Our java opts are being played with on a daily basis however here >>>>> are the settings that correspond to the sample GC log below: >>>>> >>>>> >>>>> -server -Xmx27g -Xms27g -XX:+DisableExplicitGC >>>>> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution >>>>> -Dsun.rmi.dgc.client.gcInterval=900000 >>>>> -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g >>>>> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps >>>>> -XX:+PrintGCDetails >> >> A new size of 8GiB seems high to me, but I've never run with a total >> heap size of anything nearly as large as 27GiB, so maybe that's a >> reasonable size. My opinion here is probably irrelevant. >> >>>>> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808 >>>>> bytes, new threshold 15 (max 15) - age 1: 50457968 bytes, >>>>> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs] >>>>> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 >>>>> sys=0.00, real=0.07 secs] >> >> Odd that user+sys < real. Hmm. >> >> [...] >> >>>>> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808 >>>>> bytes, new threshold 15 (max 15) - age 1: 41430416 bytes, >>>>> 41430416 total - age 3: 22728376 bytes, 64158792 total - age >>>>> 5: 19599960 bytes, 83758752 total - age 6: 21847616 bytes, >>>>> 105606368 total - age 7: 27667592 bytes, 133273960 total - age >>>>> 8: 10904 bytes, 133284864 total - age 9: 31824256 bytes, >>>>> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs] >>>>> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 >>>>> sys=0.01, real=0.14 secs] >> >> 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed >> above. Have you been able to actually instrument a 5s-20s GC-pause? >> Or, are you saying that your webapp appears to stall for that long and >> those observances coincide with full-GC runs? >> >>>>> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), >>>>> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm : >>>>> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87 >>>>> sys=0.02, real=3.41 secs] >> >> Also relatively short. >> >>> They fluctuate. >> >> >>>>> **Last Full GC** >>>>> >>>>> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808 >>>>> bytes, new threshold 15 (max 15) - age 1: 33948208 bytes, >>>>> 33948208 total - age 2: 88280 bytes, 34036488 total - age >>>>> 3: 19872472 bytes, 53908960 total - age 4: 16072608 bytes, >>>>> 69981568 total - age 5: 15718712 bytes, 85700280 total - age >>>>> 6: 15771016 bytes, 101471296 total - age 7: 16895976 bytes, >>>>> 118367272 total - age 8: 24233728 bytes, 142601000 total : >>>>> 7618727K->200950K(7922624K), 0.1728420 secs] >>>>> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 >>>>> sys=0.01, real=0.18 secs] >> >> Note that the above is not a "Full GC". This one /is/: >> >>>>> 20526.469: [Full GC 20526.475: [CMS: >>>>> 9175755K->9210800K(19922944K), 33.1161300 secs] >>>>> 13632232K->9210800K(27845568K), [CMS Perm : >>>>> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 >>>>> sys=0.02, real=33.12 secs] >> >> Okay, that one's a doosey. >> >> It's weird that it looks like after the full GC, the heap has *lost* >> some space ("CMS: 9175755K->9210800K"). Can you watch the logs and >> take a heap dump (which may take a LONG time with such a big heap, so >> be prepared for another long pause) after one of these full GCs? >> >> It would be better to get a heap dump both before AND after the full >> GC, but that can ruin your performance if you enable those automated >> GC options. >> >>> We can run this after our next code release in two days. >> >> >>>>> Any help would be greatly appreciated. >> >> Are you populating huge caches on startup or anything like that? >> Generational GC performance is related to the number of surviving >> objects that have to be moved from one generation to the next, so if >> you have a huge population of objects surviving each collection one >> after the other (which is what I might expect if you had an enormous >> cache being fulled all at once) then your performance will suck for a >> while until those objects are tenured enough to avoid periodic >> collections. >> >>> This was our initial assumption however since we're using an in memory >>> EhCache we would expect this issue to re-accure when tomcat is >>>restarted. >>> It doesn't. It'll only show up if we stop tomcat, reload the WAR and >>> clean the work directory. > >Which version of EhCache? How large is the cache, are you using RMI >replication or disk overflow? We're not using replication or disk overflow. > > >p > > >> Are you able to inspect the various generations before and after these >> pauses? I'd be interested to see if you have a huge population of >> objects being tenured at once and what types of objects they are (they >> might be something obvious like instances of our.big.fat.CacheEntry or >> something like that). >> >>> We'll give it a try >> >>> Thanks >> >> >> >> -chris >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org >> For additional commands, e-mail: users-h...@tomcat.apache.org >> > >-- > >[key:62590808] > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org