On 5/30/12 2:38 AM, "Pid" <p...@pidster.com> wrote:

>On 29/05/2012 20:50, Kevin wrote:
>> 
>> 
>> On 5/29/12 11:09 AM, "Christopher Schultz"
>><ch...@christopherschultz.net>
>> wrote:
>> 
>> Kevin,
>> 
>> (FYI these are always a PITA to diagnose...)
>> 
>> On 5/29/12 11:03 AM, Kevin wrote:
>>>>> We have a cluster of tomcat servers being used on a very high
>>>>> volume website. We've noticed that for the first 5-6 hours after an
>>>>> application re-load that Full GC will run every 2 minutes pausing
>>>>> the application for anywhere between 5 and 20 seconds. After 5-6
>>>>> hours the full GC will no longer run until tomcat is reloaded.
>>>>> This behavior is repeatable.
>>>>>
>>>>> We discounted our EhCache settings as potential issue because
>>>>> tomcat can be killed without this issue showing up.  It's not until
>>>>> we actually reload the WAR with changes (clearing the work
>>>>> directory) and restart tomcat that this issue shows up.  Traffic
>>>>> level isn't a factor we'll go through our peak hours with no
>>>>> problem.
>> 
>> All good things to know.
>> 
>>>>> The servers are all dual quad core with 32GB of RAM running Centos
>>>>> 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
>>>>> change suggested on the net and by developers and system
>>>>> administrators.
>> 
>> Did you try looking at what the cause might be, or just blindly apply
>> suggestions to see if something would stick?
>> 
>>> Initially we looked closely at it trying to solve the problem as time
>>>wore
>>> on we tried anything.  This has been going on for months now.
>> 
>> 
>> 
>>>>> Our java opts are being played with on a daily basis however here
>>>>> are the settings that correspond to the sample GC log below:
>>>>>
>>>>>
>>>>> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
>>>>> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
>>>>> -Dsun.rmi.dgc.client.gcInterval=900000
>>>>> -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g
>>>>> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
>>>>> -XX:+PrintGCDetails
>> 
>> A new size of 8GiB seems high to me, but I've never run with a total
>> heap size of anything nearly as large as 27GiB, so maybe that's a
>> reasonable size. My opinion here is probably irrelevant.
>> 
>>>>> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
>>>>> bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
>>>>> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs]
>>>>> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
>>>>> sys=0.00, real=0.07 secs]
>> 
>> Odd that user+sys < real. Hmm.
>> 
>> [...]
>> 
>>>>> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
>>>>> bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
>>>>> 41430416 total - age   3:   22728376 bytes,   64158792 total - age
>>>>> 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
>>>>> 105606368 total - age   7:   27667592 bytes,  133273960 total - age
>>>>> 8:      10904 bytes,  133284864 total - age   9:   31824256 bytes,
>>>>> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs]
>>>>> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
>>>>> sys=0.01, real=0.14 secs]
>> 
>> 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
>> above. Have you been able to actually instrument a 5s-20s GC-pause?
>> Or, are you saying that your webapp appears to stall for that long and
>> those observances coincide with full-GC runs?
>> 
>>>>> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
>>>>> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
>>>>> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
>>>>> sys=0.02, real=3.41 secs]
>> 
>> Also relatively short.
>> 
>>> They fluctuate.
>> 
>> 
>>>>> **Last Full GC**
>>>>>
>>>>> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
>>>>> bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
>>>>> 33948208 total - age   2:      88280 bytes,   34036488 total - age
>>>>> 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
>>>>> 69981568 total - age   5:   15718712 bytes,   85700280 total - age
>>>>> 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
>>>>> 118367272 total - age   8:   24233728 bytes,  142601000 total :
>>>>> 7618727K->200950K(7922624K), 0.1728420 secs]
>>>>> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
>>>>> sys=0.01, real=0.18 secs]
>> 
>> Note that the above is not a "Full GC". This one /is/:
>> 
>>>>> 20526.469: [Full GC 20526.475: [CMS:
>>>>> 9175755K->9210800K(19922944K), 33.1161300 secs]
>>>>> 13632232K->9210800K(27845568K), [CMS Perm :
>>>>> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
>>>>> sys=0.02, real=33.12 secs]
>> 
>> Okay, that one's a doosey.
>> 
>> It's weird that it looks like after the full GC, the heap has *lost*
>> some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
>> take a heap dump (which may take a LONG time with such a big heap, so
>> be prepared for another long pause) after one of these full GCs?
>> 
>> It would be better to get a heap dump both before AND after the full
>> GC, but that can ruin your performance if you enable those automated
>> GC options.
>> 
>>> We can run this after our next code release in two days.
>> 
>> 
>>>>> Any help would be greatly appreciated.
>> 
>> Are you populating huge caches on startup or anything like that?
>> Generational GC performance is related to the number of surviving
>> objects that have to be moved from one generation to the next, so if
>> you have a huge population of objects surviving each collection one
>> after the other (which is what I might expect if you had an enormous
>> cache being fulled all at once) then your performance will suck for a
>> while until those objects are tenured enough to avoid periodic
>> collections.
>> 
>>> This was our initial assumption however since we're using an in memory
>>> EhCache we would expect this issue to re-accure when tomcat is
>>>restarted.
>>> It doesn't.  It'll only show up if we stop tomcat, reload the WAR and
>>> clean the work directory.
>
>Which version of EhCache?  How large is the cache, are you using RMI
>replication or disk overflow?


We're not using replication or disk overflow.

>
>
>p
>
>
>> Are you able to inspect the various generations before and after these
>> pauses? I'd be interested to see if you have a huge population of
>> objects being tenured at once and what types of objects they are (they
>> might be something obvious like instances of our.big.fat.CacheEntry or
>> something like that).
>> 
>>> We'll give it a try
>> 
>>> Thanks
>> 
>> 
>> 
>> -chris
>>>
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>> For additional commands, e-mail: users-h...@tomcat.apache.org
>> 
>
>-- 
>
>[key:62590808]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to