Hi Ruchir, With the large number of blocked flushes and the number of pending compactions would still indicate IO contention. Can you post the output of 'iostat -x 5 5'
If you do in fact have spare IO, there are several configuration options you can tune such as increasing the number of flush writers and compaction_throughput_mb_per_sec Mark On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > Also Mark to your comment on my tpstats output, below is my iostat output, > and the iowait is at 4.59%, which means no IO pressure, but we are still > seeing the bad flush performance. Should we try increasing the flush > writers? > > > Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 > _x86_64_ (24 CPU) > > avg-cpu: %user %nice %system %iowait %steal %idle > 5.80 10.25 0.65 4.59 0.00 78.72 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda 103.83 9630.62 11982.60 3231174328 4020290310 > dm-0 13.57 160.17 81.12 53739546 27217432 > dm-1 7.59 16.94 43.77 5682200 14686784 > dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 > sdb 206.09 22789.19 33569.27 7646015080 11262843224 > > > > On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > >> nodetool status: >> >> Datacenter: datacenter1 >> ======================= >> Status=Up/Down >> |/ State=Normal/Leaving/Joining/Moving >> -- Address Load Tokens Owns (effective) Host ID >> Rack >> UN 10.10.20.27 1.89 TB 256 25.4% >> 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 >> UN 10.10.20.62 1.83 TB 256 25.5% >> 84b47313-da75-4519-94f3-3951d554a3e5 rack1 >> UN 10.10.20.47 1.87 TB 256 24.7% >> bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 >> UN 10.10.20.45 1.7 TB 256 22.6% >> 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 >> UN 10.10.20.15 1.86 TB 256 24.5% >> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 >> UN 10.10.20.31 1.87 TB 256 24.9% >> 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 >> UN 10.10.20.35 1.86 TB 256 25.8% >> 17cb8772-2444-46ff-8525-33746514727d rack1 >> UN 10.10.20.51 1.89 TB 256 25.0% >> 0343cd58-3686-465f-8280-56fb72d161e2 rack1 >> UN 10.10.20.19 1.91 TB 256 25.5% >> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 >> UN 10.10.20.39 1.93 TB 256 26.0% >> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 >> UN 10.10.20.52 1.81 TB 256 25.4% >> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 >> UN 10.10.20.22 1.89 TB 256 24.8% >> 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 >> >> >> Note: The new node is not part of the above list. >> >> nodetool compactionstats: >> >> pending tasks: 1649 >> compaction type keyspace column family completed >> total unit progress >> Compaction iprod customerorder 1682804084 >> 17956558077 bytes 9.37% >> Compaction prodgatecustomerorder >> 1664239271 1693502275 bytes 98.27% >> Compaction qa_config_bkupfixsessionconfig_hist >> 2443 27253 bytes 8.96% >> Compaction prodgatecustomerorder_hist >> 1770577280 5026699390 bytes 35.22% >> Compaction iprodgatecustomerorder_hist >> 2959560205 312350192622 bytes 0.95% >> >> >> >> >> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com> >> wrote: >> >>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>> including the new one. >>> >>> >>> Ok so you have num_tokens set to 256 for all nodes with initial_token >>> commented out, this means you are using vnodes and the new node will >>> automatically grab a list of tokens to take over responsibility for. >>> >>> Pool Name Active Pending Completed Blocked >>>> All time blocked >>>> FlushWriter 0 0 1136 0 >>>> 512 >>>> >>>> Looks like about 50% of flushes are blocked. >>>> >>> >>> This is a problem as it indicates that the IO system cannot keep up. >>> >>> Just ran this on the new node: >>>> nodetool netstats | grep "Streaming from" | wc -l >>>> 10 >>> >>> >>> This is normal as the new node will most likely take tokens from all >>> nodes in the cluster. >>> >>> Sorry for the multiple updates, but another thing I found was all the >>>> other existing nodes have themselves in the seeds list, but the new node >>>> does not have itself in the seeds list. Can that cause this issue? >>> >>> >>> Seeds are only used when a new node is bootstrapping into the cluster >>> and needs a set of ips to contact and discover the cluster, so this would >>> have no impact on data sizes or streaming. In general it would be >>> considered best practice to have a set of 2-3 seeds from each data center, >>> with all nodes having the same seed list. >>> >>> >>> What is the current output of 'nodetool compactionstats'? Could you also >>> paste the output of nodetool status <keyspace>? >>> >>> Mark >>> >>> >>> >>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com> wrote: >>> >>>> Sorry for the multiple updates, but another thing I found was all the >>>> other existing nodes have themselves in the seeds list, but the new node >>>> does not have itself in the seeds list. Can that cause this issue? >>>> >>>> >>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com> >>>> wrote: >>>> >>>>> Just ran this on the new node: >>>>> >>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>> 10 >>>>> >>>>> Seems like the new node is receiving data from 10 other nodes. Is that >>>>> expected in a vnodes enabled environment? >>>>> >>>>> Ruchir. >>>>> >>>>> >>>>> >>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com> >>>>> wrote: >>>>> >>>>>> Also not sure if this is relevant but just noticed the nodetool >>>>>> tpstats output: >>>>>> >>>>>> Pool Name Active Pending Completed >>>>>> Blocked All time blocked >>>>>> FlushWriter 0 0 1136 >>>>>> 0 512 >>>>>> >>>>>> Looks like about 50% of flushes are blocked. >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>>> including the new one. >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy <mark.re...@boxever.com> >>>>>>> wrote: >>>>>>> >>>>>>>> My understanding was that if initial_token is left empty on the new >>>>>>>>> node, it just contacts the heaviest node and bisects its token range. >>>>>>>> >>>>>>>> >>>>>>>> If you are using vnodes and you have num_tokens set to 256 the new >>>>>>>> node will take token ranges dynamically. What is the configuration of >>>>>>>> your >>>>>>>> other nodes, are you setting num_tokens or initial_token on those? >>>>>>>> >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Patricia for your response! >>>>>>>>> >>>>>>>>> On the new node, I just see a lot of the following: >>>>>>>>> >>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java (line >>>>>>>>> 400) Writing Memtable >>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 >>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to >>>>>>>>> >>>>>>>>> so basically it is just busy flushing, and compacting. Would you >>>>>>>>> have any ideas on why the 2x disk space blow up. My understanding was >>>>>>>>> that >>>>>>>>> if initial_token is left empty on the new node, it just contacts the >>>>>>>>> heaviest node and bisects its token range. And the heaviest node is >>>>>>>>> around >>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because >>>>>>>>> compaction is falling behind? >>>>>>>>> >>>>>>>>> Ruchir >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla < >>>>>>>>> patri...@thelastpickle.com> wrote: >>>>>>>>> >>>>>>>>>> Ruchir, >>>>>>>>>> >>>>>>>>>> What exactly are you seeing in the logs? Are you running major >>>>>>>>>> compactions on the new bootstrapping node? >>>>>>>>>> >>>>>>>>>> With respect to the seed list, it is generally advisable to use 3 >>>>>>>>>> seed nodes per AZ / DC. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha <ruchir....@gmail.com >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node >>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. The >>>>>>>>>>> bootstrap >>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on >>>>>>>>>>> the new >>>>>>>>>>> node is already above 4 TB and still going. Is this because the new >>>>>>>>>>> node is >>>>>>>>>>> running major compactions while the streaming is going on? >>>>>>>>>>> >>>>>>>>>>> One thing that I noticed that seemed off was the seeds property >>>>>>>>>>> in the yaml of the 13th node comprises of 1..12. Where as the seeds >>>>>>>>>>> property on the existing 12 nodes consists of all the other nodes >>>>>>>>>>> except >>>>>>>>>>> the thirteenth node. Is this an issue? >>>>>>>>>>> >>>>>>>>>>> Any other insight is appreciated? >>>>>>>>>>> >>>>>>>>>>> Ruchir. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Patricia Gorla >>>>>>>>>> @patriciagorla >>>>>>>>>> >>>>>>>>>> Consultant >>>>>>>>>> Apache Cassandra Consulting >>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >