-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kevin,
(FYI these are always a PITA to diagnose...) On 5/29/12 11:03 AM, Kevin wrote: > We have a cluster of tomcat servers being used on a very high > volume website. We've noticed that for the first 5-6 hours after an > application re-load that Full GC will run every 2 minutes pausing > the application for anywhere between 5 and 20 seconds. After 5-6 > hours the full GC will no longer run until tomcat is reloaded. > This behavior is repeatable. > > We discounted our EhCache settings as potential issue because > tomcat can be killed without this issue showing up. It's not until > we actually reload the WAR with changes (clearing the work > directory) and restart tomcat that this issue shows up. Traffic > level isn't a factor we'll go through our peak hours with no > problem. All good things to know. > The servers are all dual quad core with 32GB of RAM running Centos > 5 and tomcat 6.0.18. We've attempted every suggested GC setting > change suggested on the net and by developers and system > administrators. Did you try looking at what the cause might be, or just blindly apply suggestions to see if something would stick? > Our java opts are being played with on a daily basis however here > are the settings that correspond to the sample GC log below: > > > -server -Xmx27g -Xms27g -XX:+DisableExplicitGC > -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution > -Dsun.rmi.dgc.client.gcInterval=900000 > -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g > -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps > -XX:+PrintGCDetails A new size of 8GiB seems high to me, but I've never run with a total heap size of anything nearly as large as 27GiB, so maybe that's a reasonable size. My opinion here is probably irrelevant. > 215.501: [GC 215.504: [ParNew Desired survivor size 238583808 > bytes, new threshold 15 (max 15) - age 1: 50457968 bytes, > 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs] > 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 > sys=0.00, real=0.07 secs] Odd that user+sys < real. Hmm. [...] > 316.456: [GC 316.459: [ParNew Desired survivor size 238583808 > bytes, new threshold 15 (max 15) - age 1: 41430416 bytes, > 41430416 total - age 3: 22728376 bytes, 64158792 total - age > 5: 19599960 bytes, 83758752 total - age 6: 21847616 bytes, > 105606368 total - age 7: 27667592 bytes, 133273960 total - age > 8: 10904 bytes, 133284864 total - age 9: 31824256 bytes, > 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs] > 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 > sys=0.01, real=0.14 secs] 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed above. Have you been able to actually instrument a 5s-20s GC-pause? Or, are you saying that your webapp appears to stall for that long and those observances coincide with full-GC runs? > 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), > 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm : > 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87 > sys=0.02, real=3.41 secs] Also relatively short. > **Last Full GC** > > 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808 > bytes, new threshold 15 (max 15) - age 1: 33948208 bytes, > 33948208 total - age 2: 88280 bytes, 34036488 total - age > 3: 19872472 bytes, 53908960 total - age 4: 16072608 bytes, > 69981568 total - age 5: 15718712 bytes, 85700280 total - age > 6: 15771016 bytes, 101471296 total - age 7: 16895976 bytes, > 118367272 total - age 8: 24233728 bytes, 142601000 total : > 7618727K->200950K(7922624K), 0.1728420 secs] > 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 > sys=0.01, real=0.18 secs] Note that the above is not a "Full GC". This one /is/: > 20526.469: [Full GC 20526.475: [CMS: > 9175755K->9210800K(19922944K), 33.1161300 secs] > 13632232K->9210800K(27845568K), [CMS Perm : > 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 > sys=0.02, real=33.12 secs] Okay, that one's a doosey. It's weird that it looks like after the full GC, the heap has *lost* some space ("CMS: 9175755K->9210800K"). Can you watch the logs and take a heap dump (which may take a LONG time with such a big heap, so be prepared for another long pause) after one of these full GCs? It would be better to get a heap dump both before AND after the full GC, but that can ruin your performance if you enable those automated GC options. > Any help would be greatly appreciated. Are you populating huge caches on startup or anything like that? Generational GC performance is related to the number of surviving objects that have to be moved from one generation to the next, so if you have a huge population of objects surviving each collection one after the other (which is what I might expect if you had an enormous cache being fulled all at once) then your performance will suck for a while until those objects are tenured enough to avoid periodic collections. Are you able to inspect the various generations before and after these pauses? I'd be interested to see if you have a huge population of objects being tenured at once and what types of objects they are (they might be something obvious like instances of our.big.fat.CacheEntry or something like that). - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/E9KUACgkQ9CaO5/Lv0PBBHwCgmJHB83ekwxiDuDoPeW6FtH29 Z/sAn3p3bZgOEBFfy7yvyAxPLKnyVMO/ =L292 -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org