Bootstrap failure

Keith Wright Wed, 05 Feb 2014 06:12:48 -0800

Hi all,

    We have been struggling with the inability to bootstrap nodes into our 
1.2.13 environment with Vnodes using centos 6.4 with Java 7.  We have an 8 node 
cluster (32 GB RAM, dual hex core, SSDs, 8 GB heap with 1200 MB eden space, 
RF3) with around 1 TB per node using murmur3.   When we go to bootstrap a new 
node this is what we see:


 *   Bootstrapping node assigns tokens and requests data from cluster
 *   5-6 nodes within the cluster begin to stream data
 *   Around 2 minutes after bootstrap start, between 1 and 4 nodes (sometimes 
the bootstrapping node and sometimes not) become unresponsive in par new GCs
 *   IF no nodes go down during the first 5 minutes of bootstrap, then the 
bootstrap will succeed without issue
 *   GC mired nodes tend to recover after a minute or two but the receiving 
node stops attempting to get more data from the nodes
 *   Bootstrap eventually fails (after streaming all the data from nodes that 
did not go down) with Unable to Fetch Ranges

We have tried the following and it appears that sometimes a bootstrap will 
succeed (perhaps 1 in 10) but with no discernible pattern:

 *   Increase phi_convict to 16
 *   Restart all nodes prior to bootstrap (to ensure heap is as “clean” as 
possible)
 *   Stop production load against the cluster (to reduce par new churn); after 
5 minutes we know if the bootstrap will succeed so we then re-enable load
 *   Distribute soft interrupts across all CPUs

Below is an output from the GC log of the bootstrapping node when it was stuck 
in GC.

Has anyone seen this before?  This is our production cluster and our inability 
to grow is a BLOCKING issue for us.  Any ideas would be VERY helpful.

Thanks


{Heap before GC invocations=109 (full 0):

 par new generation   total 1105920K, used 1021140K [0x00000005fae00000, 
0x0000000645e00000, 0x0000000645e00000)

  eden space 983040K, 100% used [0x00000005fae00000, 0x0000000636e00000, 
0x0000000636e00000)

  from space 122880K,  31% used [0x000000063e600000, 0x0000000640b350f0, 
0x0000000645e00000)

  to   space 122880K,   0% used [0x0000000636e00000, 0x0000000636e00000, 
0x000000063e600000)

 concurrent mark-sweep generation total 7159808K, used 3826815K 
[0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)

 concurrent-mark-sweep perm gen total 24512K, used 24368K [0x00000007fae00000, 
0x00000007fc5f0000, 0x0000000800000000)

2014-02-05T13:27:49.621+0000: 210.242: [GC 210.242: [ParNew: 
1021140K->122880K(1105920K), 0.2963210 secs] 4847955K->4024095K(8265728K), 
0.2965270 secs] [Times: user=4.97 sys=0.00, real=0.30 secs]

Heap after GC invocations=110 (full 0):

 par new generation   total 1105920K, used 122880K [0x00000005fae00000, 
0x0000000645e00000, 0x0000000645e00000)

  eden space 983040K,   0% used [0x00000005fae00000, 0x00000005fae00000, 
0x0000000636e00000)

  from space 122880K, 100% used [0x0000000636e00000, 0x000000063e600000, 
0x000000063e600000)

  to   space 122880K,   0% used [0x000000063e600000, 0x000000063e600000, 
0x0000000645e00000)

 concurrent mark-sweep generation total 7159808K, used 3901215K 
[0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)

 concurrent-mark-sweep perm gen total 24512K, used 24368K [0x00000007fae00000, 
0x00000007fc5f0000, 0x0000000800000000)

}

Total time for which application threads were stopped: 0.2968550 seconds

Application time: 1.5953840 seconds

Total time for which application threads were stopped: 0.0002040 seconds

Application time: 0.0000510 seconds

Bootstrap failure

Reply via email to