Zookeeper is not on the same nodes...and yes we could up to 120 seconds but then we are back to AWOL nodes for 118 seconds is OK which it is not.
Bottom line is the JVM is our enemy here (as it always has been) and we had high hopes for Todd's fix, and it is not panning out for us...yet. On Mon, May 23, 2011 at 11:07 AM, Michael Segel <michael_se...@hotmail.com>wrote: > > Besides this... > JRE version: 6.0_17-b17 > > Just a silly question ... > What happens if you double the zookeeper time out to 120 seconds? > > Also I'm going to assume that you're not running your ZK on the same nodes > as your data nodes, but you know what they say about assumptions... > > > > From: tdunn...@maprtech.com > > Date: Mon, 23 May 2011 07:33:05 -0700 > > Subject: Re: mslab enabled jvm crash > > To: user@hbase.apache.org > > > > Do you have the same problem with a more recent JVM? > > > > On Mon, May 23, 2011 at 4:52 AM, Wayne <wav...@gmail.com> wrote: > > > > > I have switched to using the mslab enabled java setting to try to avoid > GC > > > causing nodes to go awol but it almost appears to be worse. Below is > the > > > latest problem with the JVM apparently actually crashing. I am using > 0.90.1 > > > with an 8GB heap. Is there a recommended JVM and recommended settings > to be > > > used? As it stands right now we can not run 24 hours under heavy write > load > > > without a node being taken out by zookeeper for GCing > 60 sec or other > > > problems like below. > > > > > > Any help would be greatly appreciated. > > > > > > > > > 2011-05-23T02:34:51.626+0000: 13902.361: [GC 13902.361: [ParNew: > > > 249216K->27648K(249216K), 0.1119520 secs] 7546544K->7433319K(8360960K), > > > 0.1120390 secs] [Times: user=1.14 sys=0.05, real=0.11 secs] > > > 2011-05-23T02:34:52.292+0000: 13903.027: [GC 13903.027: [ParNew: > > > 249216K->27648K(249216K), 0.0732800 secs] 7654887K->7506032K(8360960K), > > > 0.0733690 secs] [Times: user=0.76 sys=0.02, real=0.08 secs] > > > 2011-05-23T02:34:52.721+0000: 13903.456: [CMS-concurrent-mark: > 8.137/10.065 > > > secs] [Times: user=60.86 sys=2.98, real=10.06 secs] > > > 2011-05-23T02:34:52.721+0000: 13903.456: > [CMS-concurrent-preclean-start] > > > 2011-05-23T02:34:52.839+0000: 13903.574: [GC 13903.574: [ParNew: > > > 249216K->27648K(249216K), 0.0575510 secs] 7727600K->7562758K(8360960K), > > > 0.0576420 secs] [Times: user=0.62 sys=0.02, real=0.06 secs] > > > 2011-05-23T02:34:53.190+0000: 13903.925: [GC 13903.925: [ParNew: > > > 249171K->27648K(249216K), 0.1108480 secs] 7784281K->7661505K(8360960K), > > > 0.1109440 secs] [Times: user=1.10 sys=0.03, real=0.11 secs] > > > 2011-05-23T02:34:53.539+0000: 13904.274: [GC 13904.274: [ParNew > (promotion > > > failed): 249216K->249216K(249216K), 0.1207770 secs]13904.395: > > > [CMS2011-05-23T02:34:54.310+0000: 13905.045: [CMS-concurrent-preclean: > > > 1.245/1.589 secs] [Times: user=5.99 sys=0.13, real=1.59 secs] > > > (concurrent mode failure)# > > > # A fatal error has been detected by the Java Runtime Environment: > > > # > > > # SIGSEGV (0xb) at pc=0x00002b19debbe665, pid=25868, tid=1078290752 > > > # > > > # JRE version: 6.0_17-b17 > > > # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) > > > # Derivative: IcedTea6 1.7.10 > > > # Distribution: Custom build (Wed May 4 23:17:24 EDT 2011) > > > # Problematic frame: > > > # V [libjvm.so+0x29d665] > > > # > > > # An error report file with more information is saved as: > > > # .../hbase-0.90.1/hs_err_pid25868.log > > > # > > > # If you would like to submit a bug report, please include > > > # instructions how to reproduce the bug and visit: > > > # http://icedtea.classpath.org/bugzilla > > > > >