Also, right now the "top" command shows that we are at 500-700% CPU, and we have 23 total processors, which means we have a lot of idle CPU left over, so throwing more threads at compaction and flush should alleviate the problem?
On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > > Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is > set to 0, which I believe disables throttling. > > Also, Here is the iostat -x 5 5 output: > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 > 1.56 14.62 1.05 11.21 > dm-0 0.00 0.00 3.59 18.82 166.52 150.35 14.14 > 0.44 19.49 0.54 1.22 > dm-1 0.00 0.00 2.32 5.37 18.56 42.98 8.00 > 0.76 98.82 0.43 0.33 > dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.87 13.30 > 5.57 0.90 0.06 36.00 > sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 > 4.43 20.68 1.29 27.64 > > avg-cpu: %user %nice %system %iowait %steal %idle > 14.64 10.75 1.81 13.50 0.00 59.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 15.40 1344.60 68.80 145.60 4964.80 11790.40 78.15 > 0.38 1.80 0.80 17.10 > dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 > 4.88 3.90 0.09 11.58 > dm-1 0.00 0.00 1.60 0.00 12.80 0.00 8.00 > 0.03 16.00 2.00 0.32 > dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 > 9.89 2847.50 109.52 0.05 93.50 > sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 > 179.62 72.88 66.01 1.04 93.42 > > avg-cpu: %user %nice %system %iowait %steal %idle > 15.51 19.77 1.97 5.02 0.00 57.73 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 16.20 523.40 60.00 285.00 5220.80 5913.60 32.27 > 0.25 0.72 0.60 20.86 > dm-0 0.00 0.00 0.80 1.40 32.00 11.20 19.64 > 0.01 3.18 1.55 0.34 > dm-1 0.00 0.00 1.60 0.00 12.80 0.00 8.00 > 0.03 21.00 2.62 0.42 > dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.80 18.20 > 251.66 184.72 0.10 63.48 > sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 > 5.98 40.92 1.23 58.28 > > avg-cpu: %user %nice %system %iowait %steal %idle > 16.59 16.34 2.03 9.01 0.00 56.04 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 5.40 320.00 37.40 159.80 2483.20 3529.60 30.49 > 0.10 0.52 0.39 7.76 > dm-0 0.00 0.00 0.20 3.60 1.60 28.80 8.00 > 0.00 0.68 0.68 0.26 > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 > 11.86 869.18 48.82 0.06 76.96 > sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.25 > 21.46 23.75 0.97 75.08 > > > > On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy <mark.re...@boxever.com> wrote: > >> Hi Ruchir, >> >> With the large number of blocked flushes and the number of pending >> compactions would still indicate IO contention. Can you post the output of >> 'iostat -x 5 5' >> >> If you do in fact have spare IO, there are several configuration options >> you can tune such as increasing the number of flush writers and >> compaction_throughput_mb_per_sec >> >> Mark >> >> >> On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote: >> >>> Also Mark to your comment on my tpstats output, below is my iostat >>> output, and the iowait is at 4.59%, which means no IO pressure, but we are >>> still seeing the bad flush performance. Should we try increasing the flush >>> writers? >>> >>> >>> Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 >>> _x86_64_ (24 CPU) >>> >>> avg-cpu: %user %nice %system %iowait %steal %idle >>> 5.80 10.25 0.65 4.59 0.00 78.72 >>> >>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >>> sda 103.83 9630.62 11982.60 3231174328 4020290310 >>> dm-0 13.57 160.17 81.12 53739546 27217432 >>> dm-1 7.59 16.94 43.77 5682200 14686784 >>> dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 >>> sdb 206.09 22789.19 33569.27 7646015080 11262843224 >>> >>> >>> >>> On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com> >>> wrote: >>> >>>> nodetool status: >>>> >>>> Datacenter: datacenter1 >>>> ======================= >>>> Status=Up/Down >>>> |/ State=Normal/Leaving/Joining/Moving >>>> -- Address Load Tokens Owns (effective) Host ID >>>> Rack >>>> UN 10.10.20.27 1.89 TB 256 25.4% >>>> 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 >>>> UN 10.10.20.62 1.83 TB 256 25.5% >>>> 84b47313-da75-4519-94f3-3951d554a3e5 rack1 >>>> UN 10.10.20.47 1.87 TB 256 24.7% >>>> bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 >>>> UN 10.10.20.45 1.7 TB 256 22.6% >>>> 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 >>>> UN 10.10.20.15 1.86 TB 256 24.5% >>>> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 >>>> UN 10.10.20.31 1.87 TB 256 24.9% >>>> 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 >>>> UN 10.10.20.35 1.86 TB 256 25.8% >>>> 17cb8772-2444-46ff-8525-33746514727d rack1 >>>> UN 10.10.20.51 1.89 TB 256 25.0% >>>> 0343cd58-3686-465f-8280-56fb72d161e2 rack1 >>>> UN 10.10.20.19 1.91 TB 256 25.5% >>>> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 >>>> UN 10.10.20.39 1.93 TB 256 26.0% >>>> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 >>>> UN 10.10.20.52 1.81 TB 256 25.4% >>>> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 >>>> UN 10.10.20.22 1.89 TB 256 24.8% >>>> 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 >>>> >>>> >>>> Note: The new node is not part of the above list. >>>> >>>> nodetool compactionstats: >>>> >>>> pending tasks: 1649 >>>> compaction type keyspace column family >>>> completed total unit progress >>>> Compaction iprod customerorder >>>> 1682804084 17956558077 bytes 9.37% >>>> Compaction prodgatecustomerorder >>>> 1664239271 1693502275 bytes 98.27% >>>> Compaction qa_config_bkupfixsessionconfig_hist >>>> 2443 27253 bytes 8.96% >>>> Compaction prodgatecustomerorder_hist >>>> 1770577280 5026699390 bytes 35.22% >>>> Compaction iprodgatecustomerorder_hist >>>> 2959560205 312350192622 bytes 0.95% >>>> >>>> >>>> >>>> >>>> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com> >>>> wrote: >>>> >>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>> including the new one. >>>>> >>>>> >>>>> Ok so you have num_tokens set to 256 for all nodes with initial_token >>>>> commented out, this means you are using vnodes and the new node will >>>>> automatically grab a list of tokens to take over responsibility for. >>>>> >>>>> Pool Name Active Pending Completed Blocked >>>>>> All time blocked >>>>>> FlushWriter 0 0 1136 >>>>>> 0 512 >>>>>> >>>>>> Looks like about 50% of flushes are blocked. >>>>>> >>>>> >>>>> This is a problem as it indicates that the IO system cannot keep up. >>>>> >>>>> Just ran this on the new node: >>>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>>> 10 >>>>> >>>>> >>>>> This is normal as the new node will most likely take tokens from all >>>>> nodes in the cluster. >>>>> >>>>> Sorry for the multiple updates, but another thing I found was all the >>>>>> other existing nodes have themselves in the seeds list, but the new node >>>>>> does not have itself in the seeds list. Can that cause this issue? >>>>> >>>>> >>>>> Seeds are only used when a new node is bootstrapping into the cluster >>>>> and needs a set of ips to contact and discover the cluster, so this would >>>>> have no impact on data sizes or streaming. In general it would be >>>>> considered best practice to have a set of 2-3 seeds from each data center, >>>>> with all nodes having the same seed list. >>>>> >>>>> >>>>> What is the current output of 'nodetool compactionstats'? Could you >>>>> also paste the output of nodetool status <keyspace>? >>>>> >>>>> Mark >>>>> >>>>> >>>>> >>>>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com> >>>>> wrote: >>>>> >>>>>> Sorry for the multiple updates, but another thing I found was all the >>>>>> other existing nodes have themselves in the seeds list, but the new node >>>>>> does not have itself in the seeds list. Can that cause this issue? >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Just ran this on the new node: >>>>>>> >>>>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>>>> 10 >>>>>>> >>>>>>> Seems like the new node is receiving data from 10 other nodes. Is >>>>>>> that expected in a vnodes enabled environment? >>>>>>> >>>>>>> Ruchir. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Also not sure if this is relevant but just noticed the nodetool >>>>>>>> tpstats output: >>>>>>>> >>>>>>>> Pool Name Active Pending Completed >>>>>>>> Blocked All time blocked >>>>>>>> FlushWriter 0 0 1136 >>>>>>>> 0 512 >>>>>>>> >>>>>>>> Looks like about 50% of flushes are blocked. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>>>>> including the new one. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy < >>>>>>>>> mark.re...@boxever.com> wrote: >>>>>>>>> >>>>>>>>>> My understanding was that if initial_token is left empty on the >>>>>>>>>>> new node, it just contacts the heaviest node and bisects its token >>>>>>>>>>> range. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If you are using vnodes and you have num_tokens set to 256 the >>>>>>>>>> new node will take token ranges dynamically. What is the >>>>>>>>>> configuration of >>>>>>>>>> your other nodes, are you setting num_tokens or initial_token on >>>>>>>>>> those? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Mark >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Patricia for your response! >>>>>>>>>>> >>>>>>>>>>> On the new node, I just see a lot of the following: >>>>>>>>>>> >>>>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java >>>>>>>>>>> (line 400) Writing Memtable >>>>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 >>>>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to >>>>>>>>>>> >>>>>>>>>>> so basically it is just busy flushing, and compacting. Would you >>>>>>>>>>> have any ideas on why the 2x disk space blow up. My understanding >>>>>>>>>>> was that >>>>>>>>>>> if initial_token is left empty on the new node, it just contacts the >>>>>>>>>>> heaviest node and bisects its token range. And the heaviest node is >>>>>>>>>>> around >>>>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because >>>>>>>>>>> compaction is falling behind? >>>>>>>>>>> >>>>>>>>>>> Ruchir >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla < >>>>>>>>>>> patri...@thelastpickle.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ruchir, >>>>>>>>>>>> >>>>>>>>>>>> What exactly are you seeing in the logs? Are you running major >>>>>>>>>>>> compactions on the new bootstrapping node? >>>>>>>>>>>> >>>>>>>>>>>> With respect to the seed list, it is generally advisable to use >>>>>>>>>>>> 3 seed nodes per AZ / DC. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha < >>>>>>>>>>>> ruchir....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node >>>>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. The >>>>>>>>>>>>> bootstrap >>>>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on >>>>>>>>>>>>> the new >>>>>>>>>>>>> node is already above 4 TB and still going. Is this because the >>>>>>>>>>>>> new node is >>>>>>>>>>>>> running major compactions while the streaming is going on? >>>>>>>>>>>>> >>>>>>>>>>>>> One thing that I noticed that seemed off was the seeds >>>>>>>>>>>>> property in the yaml of the 13th node comprises of 1..12. Where >>>>>>>>>>>>> as the >>>>>>>>>>>>> seeds property on the existing 12 nodes consists of all the other >>>>>>>>>>>>> nodes >>>>>>>>>>>>> except the thirteenth node. Is this an issue? >>>>>>>>>>>>> >>>>>>>>>>>>> Any other insight is appreciated? >>>>>>>>>>>>> >>>>>>>>>>>>> Ruchir. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Patricia Gorla >>>>>>>>>>>> @patriciagorla >>>>>>>>>>>> >>>>>>>>>>>> Consultant >>>>>>>>>>>> Apache Cassandra Consulting >>>>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >