Still having issues with node bootstrapping. The new node just died, because it Full Gced, the nodes it had actual streams with noticed its down. After the full gc finished the new node printed this log :
ERROR 02:52:36,259 Stream failed because /10.10.20.35 died or was restarted/removed (streams may still be active in background, but further streams won't be started) Here 10.10.20.35 is an existing node, the new guy was streaming from. A similar log was printed for every other node on the cluster. Why did the new node just exit after the FGC pause? We have heap dumps enabled on Full GC's and this are the top offenders on the new node. A new entry that I noticed is the CompressionMetaData chunks. Anything I can do to optimize that? num #instances #bytes class name ---------------------------------------------- 1: 42508421 4818885752 [B 2: 65860543 3161306064 java.nio.HeapByteBuffer 3: 124361093 2984666232 org.apache.cassandra.io.compress.CompressionMetadata$Chunk 4: 29745665 1427791920 edu.stanford.ppl.concurrent.SnapTreeMap$Node 5: 29810362 953931584 org.apache.cassandra.db.Column 6: 31623 498012768 [Lorg.apache.cassandra.io.compress.CompressionMetadata$Chunk; On Tue, Aug 5, 2014 at 2:59 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > Also, right now the "top" command shows that we are at 500-700% CPU, and > we have 23 total processors, which means we have a lot of idle CPU left > over, so throwing more threads at compaction and flush should alleviate the > problem? > > > On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > >> >> Right now, we have 6 flush writers and compaction_throughput_mb_per_sec >> is set to 0, which I believe disables throttling. >> >> Also, Here is the iostat -x 5 5 output: >> >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 >> 204.34 1.56 14.62 1.05 11.21 >> dm-0 0.00 0.00 3.59 18.82 166.52 150.35 >> 14.14 0.44 19.49 0.54 1.22 >> dm-1 0.00 0.00 2.32 5.37 18.56 42.98 >> 8.00 0.76 98.82 0.43 0.33 >> dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.87 >> 13.30 5.57 0.90 0.06 36.00 >> sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 >> 272.46 4.43 20.68 1.29 27.64 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 14.64 10.75 1.81 13.50 0.00 59.29 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 15.40 1344.60 68.80 145.60 4964.80 11790.40 >> 78.15 0.38 1.80 0.80 17.10 >> dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 >> 9.59 4.88 3.90 0.09 11.58 >> dm-1 0.00 0.00 1.60 0.00 12.80 0.00 >> 8.00 0.03 16.00 2.00 0.32 >> dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 >> 9.89 2847.50 109.52 0.05 93.50 >> sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 >> 179.62 72.88 66.01 1.04 93.42 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 15.51 19.77 1.97 5.02 0.00 57.73 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 16.20 523.40 60.00 285.00 5220.80 5913.60 >> 32.27 0.25 0.72 0.60 20.86 >> dm-0 0.00 0.00 0.80 1.40 32.00 11.20 >> 19.64 0.01 3.18 1.55 0.34 >> dm-1 0.00 0.00 1.60 0.00 12.80 0.00 >> 8.00 0.03 21.00 2.62 0.42 >> dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.80 >> 18.20 251.66 184.72 0.10 63.48 >> sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 >> 235.35 5.98 40.92 1.23 58.28 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 16.59 16.34 2.03 9.01 0.00 56.04 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 5.40 320.00 37.40 159.80 2483.20 3529.60 >> 30.49 0.10 0.52 0.39 7.76 >> dm-0 0.00 0.00 0.20 3.60 1.60 28.80 >> 8.00 0.00 0.68 0.68 0.26 >> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 >> 11.86 869.18 48.82 0.06 76.96 >> sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 >> 188.25 21.46 23.75 0.97 75.08 >> >> >> >> On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy <mark.re...@boxever.com> >> wrote: >> >>> Hi Ruchir, >>> >>> With the large number of blocked flushes and the number of pending >>> compactions would still indicate IO contention. Can you post the output of >>> 'iostat -x 5 5' >>> >>> If you do in fact have spare IO, there are several configuration options >>> you can tune such as increasing the number of flush writers and >>> compaction_throughput_mb_per_sec >>> >>> Mark >>> >>> >>> On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote: >>> >>>> Also Mark to your comment on my tpstats output, below is my iostat >>>> output, and the iowait is at 4.59%, which means no IO pressure, but we are >>>> still seeing the bad flush performance. Should we try increasing the flush >>>> writers? >>>> >>>> >>>> Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 >>>> _x86_64_ (24 CPU) >>>> >>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>> 5.80 10.25 0.65 4.59 0.00 78.72 >>>> >>>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >>>> sda 103.83 9630.62 11982.60 3231174328 4020290310 >>>> dm-0 13.57 160.17 81.12 53739546 27217432 >>>> dm-1 7.59 16.94 43.77 5682200 14686784 >>>> dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 >>>> sdb 206.09 22789.19 33569.27 7646015080 11262843224 >>>> >>>> >>>> >>>> On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com> >>>> wrote: >>>> >>>>> nodetool status: >>>>> >>>>> Datacenter: datacenter1 >>>>> ======================= >>>>> Status=Up/Down >>>>> |/ State=Normal/Leaving/Joining/Moving >>>>> -- Address Load Tokens Owns (effective) Host ID >>>>> Rack >>>>> UN 10.10.20.27 1.89 TB 256 25.4% >>>>> 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 >>>>> UN 10.10.20.62 1.83 TB 256 25.5% >>>>> 84b47313-da75-4519-94f3-3951d554a3e5 rack1 >>>>> UN 10.10.20.47 1.87 TB 256 24.7% >>>>> bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 >>>>> UN 10.10.20.45 1.7 TB 256 22.6% >>>>> 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 >>>>> UN 10.10.20.15 1.86 TB 256 24.5% >>>>> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 >>>>> UN 10.10.20.31 1.87 TB 256 24.9% >>>>> 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 >>>>> UN 10.10.20.35 1.86 TB 256 25.8% >>>>> 17cb8772-2444-46ff-8525-33746514727d rack1 >>>>> UN 10.10.20.51 1.89 TB 256 25.0% >>>>> 0343cd58-3686-465f-8280-56fb72d161e2 rack1 >>>>> UN 10.10.20.19 1.91 TB 256 25.5% >>>>> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 >>>>> UN 10.10.20.39 1.93 TB 256 26.0% >>>>> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 >>>>> UN 10.10.20.52 1.81 TB 256 25.4% >>>>> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 >>>>> UN 10.10.20.22 1.89 TB 256 24.8% >>>>> 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 >>>>> >>>>> >>>>> Note: The new node is not part of the above list. >>>>> >>>>> nodetool compactionstats: >>>>> >>>>> pending tasks: 1649 >>>>> compaction type keyspace column family >>>>> completed total unit progress >>>>> Compaction iprod customerorder >>>>> 1682804084 17956558077 bytes 9.37% >>>>> Compaction prodgatecustomerorder >>>>> 1664239271 1693502275 bytes 98.27% >>>>> Compaction qa_config_bkupfixsessionconfig_hist >>>>> 2443 27253 bytes 8.96% >>>>> Compaction prodgatecustomerorder_hist >>>>> 1770577280 5026699390 bytes 35.22% >>>>> Compaction iprodgatecustomerorder_hist >>>>> 2959560205 312350192622 bytes 0.95% >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com> >>>>> wrote: >>>>> >>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>>> including the new one. >>>>>> >>>>>> >>>>>> Ok so you have num_tokens set to 256 for all nodes with initial_token >>>>>> commented out, this means you are using vnodes and the new node will >>>>>> automatically grab a list of tokens to take over responsibility for. >>>>>> >>>>>> Pool Name Active Pending Completed >>>>>>> Blocked All time blocked >>>>>>> FlushWriter 0 0 1136 >>>>>>> 0 512 >>>>>>> >>>>>>> Looks like about 50% of flushes are blocked. >>>>>>> >>>>>> >>>>>> This is a problem as it indicates that the IO system cannot keep up. >>>>>> >>>>>> Just ran this on the new node: >>>>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>>>> 10 >>>>>> >>>>>> >>>>>> This is normal as the new node will most likely take tokens from all >>>>>> nodes in the cluster. >>>>>> >>>>>> Sorry for the multiple updates, but another thing I found was all the >>>>>>> other existing nodes have themselves in the seeds list, but the new node >>>>>>> does not have itself in the seeds list. Can that cause this issue? >>>>>> >>>>>> >>>>>> Seeds are only used when a new node is bootstrapping into the cluster >>>>>> and needs a set of ips to contact and discover the cluster, so this would >>>>>> have no impact on data sizes or streaming. In general it would be >>>>>> considered best practice to have a set of 2-3 seeds from each data >>>>>> center, >>>>>> with all nodes having the same seed list. >>>>>> >>>>>> >>>>>> What is the current output of 'nodetool compactionstats'? Could you >>>>>> also paste the output of nodetool status <keyspace>? >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Sorry for the multiple updates, but another thing I found was all >>>>>>> the other existing nodes have themselves in the seeds list, but the new >>>>>>> node does not have itself in the seeds list. Can that cause this issue? >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Just ran this on the new node: >>>>>>>> >>>>>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>>>>> 10 >>>>>>>> >>>>>>>> Seems like the new node is receiving data from 10 other nodes. Is >>>>>>>> that expected in a vnodes enabled environment? >>>>>>>> >>>>>>>> Ruchir. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Also not sure if this is relevant but just noticed the nodetool >>>>>>>>> tpstats output: >>>>>>>>> >>>>>>>>> Pool Name Active Pending Completed >>>>>>>>> Blocked All time blocked >>>>>>>>> FlushWriter 0 0 1136 >>>>>>>>> 0 512 >>>>>>>>> >>>>>>>>> Looks like about 50% of flushes are blocked. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>>>>>> including the new one. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy < >>>>>>>>>> mark.re...@boxever.com> wrote: >>>>>>>>>> >>>>>>>>>>> My understanding was that if initial_token is left empty on the >>>>>>>>>>>> new node, it just contacts the heaviest node and bisects its token >>>>>>>>>>>> range. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If you are using vnodes and you have num_tokens set to 256 the >>>>>>>>>>> new node will take token ranges dynamically. What is the >>>>>>>>>>> configuration of >>>>>>>>>>> your other nodes, are you setting num_tokens or initial_token on >>>>>>>>>>> those? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Mark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Patricia for your response! >>>>>>>>>>>> >>>>>>>>>>>> On the new node, I just see a lot of the following: >>>>>>>>>>>> >>>>>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java >>>>>>>>>>>> (line 400) Writing Memtable >>>>>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 >>>>>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to >>>>>>>>>>>> >>>>>>>>>>>> so basically it is just busy flushing, and compacting. Would >>>>>>>>>>>> you have any ideas on why the 2x disk space blow up. My >>>>>>>>>>>> understanding was >>>>>>>>>>>> that if initial_token is left empty on the new node, it just >>>>>>>>>>>> contacts the >>>>>>>>>>>> heaviest node and bisects its token range. And the heaviest node >>>>>>>>>>>> is around >>>>>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because >>>>>>>>>>>> compaction is falling behind? >>>>>>>>>>>> >>>>>>>>>>>> Ruchir >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla < >>>>>>>>>>>> patri...@thelastpickle.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ruchir, >>>>>>>>>>>>> >>>>>>>>>>>>> What exactly are you seeing in the logs? Are you running major >>>>>>>>>>>>> compactions on the new bootstrapping node? >>>>>>>>>>>>> >>>>>>>>>>>>> With respect to the seed list, it is generally advisable to >>>>>>>>>>>>> use 3 seed nodes per AZ / DC. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha < >>>>>>>>>>>>> ruchir....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node >>>>>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. >>>>>>>>>>>>>> The bootstrap >>>>>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on >>>>>>>>>>>>>> the new >>>>>>>>>>>>>> node is already above 4 TB and still going. Is this because the >>>>>>>>>>>>>> new node is >>>>>>>>>>>>>> running major compactions while the streaming is going on? >>>>>>>>>>>>>> >>>>>>>>>>>>>> One thing that I noticed that seemed off was the seeds >>>>>>>>>>>>>> property in the yaml of the 13th node comprises of 1..12. Where >>>>>>>>>>>>>> as the >>>>>>>>>>>>>> seeds property on the existing 12 nodes consists of all the >>>>>>>>>>>>>> other nodes >>>>>>>>>>>>>> except the thirteenth node. Is this an issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Any other insight is appreciated? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ruchir. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Patricia Gorla >>>>>>>>>>>>> @patriciagorla >>>>>>>>>>>>> >>>>>>>>>>>>> Consultant >>>>>>>>>>>>> Apache Cassandra Consulting >>>>>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >