On Fri, May 27, 2011 at 9:08 AM, Jonathan Colby <jonathan.co...@gmail.com>wrote:
> Hi - > > Operations like repair and bootstrap on nodes in our cluster (average > load 150GB each) take a very long time. > > By long I mean 1-2 days. With nodetool "netstats" I can see the > progress % very slowly progressing. > > I guess there are some throttling mechanisms built into cassandra. > And yes there is also production load on these nodes so it is somewhat > understandable. Also some of out compacted data files are as 50-60 GB > each. > > I was just wondering if these times are similar to what other people > are experiencing or if there is a serious configuration problem with > our setup. > > So what have you guys seen with operations like loadbalance,repair, > cleanup, bootstrap on nodes with large amounts of data?? > > I'm not seeing too many full garbage collections. Other minor GCs are > well under a second. > > Setup info: > 0.7.4 > 5 GB heap > 8 GB ram > 64 bit linux os > AMD quad core HP blades > CMS Garbage collector with default cassandra settings > 1 TB raid 0 sata disks > across 2 datacenters, but operations within the same dc take very long too. > > > This is a netstat output of a bootstrap that has been going on for 3+ > hours: > > Mode: Normal > Streaming to: /10.47.108.103 > > /var/lib/cassandra/data/DFS/main-f-1541-Data.db/(0,32842490722),(32842490722,139556639427),(139556639427,161075890783) > progress=94624588642/161075890783 - 58% > /var/lib/cassandra/data/DFS/main-f-1455-Data.db/(0,660743002) > progress=0/660743002 - 0% > > /var/lib/cassandra/data/DFS/main-f-1444-Data.db/(0,32816130132),(32816130132,71465138397),(71465138397,90968640033) > progress=0/90968640033 - 0% > > /var/lib/cassandra/data/DFS/main-f-1540-Data.db/(0,931632934),(931632934,2621052149),(2621052149,3236107041) > progress=0/3236107041 - 0% > > /var/lib/cassandra/data/DFS/main-f-1488-Data.db/(0,33428780851),(33428780851,110546591227),(110546591227,110851587206) > progress=0/110851587206 - 0% > > /var/lib/cassandra/data/DFS/main-f-1542-Data.db/(0,24091168),(24091168,97485080),(97485080,108233211) > progress=0/108233211 - 0% > > /var/lib/cassandra/data/DFS/main-f-1544-Data.db/(0,3646406),(3646406,18065308),(18065308,25776551) > progress=0/25776551 - 0% > /var/lib/cassandra/data/DFS/main-f-1452-Data.db/(0,676616940) > progress=0/676616940 - 0% > > /var/lib/cassandra/data/DFS/main-f-1548-Data.db/(0,6957269),(6957269,48966550),(48966550,51499779) > progress=0/51499779 - 0% > > /var/lib/cassandra/data/DFS/main-f-1552-Data.db/(0,237153399),(237153399,750466875),(750466875,898056853) > progress=0/898056853 - 0% > > /var/lib/cassandra/data/DFS/main-f-1554-Data.db/(0,45155582),(45155582,195640768),(195640768,247592141) > progress=0/247592141 - 0% > /var/lib/cassandra/data/DFS/main-f-1449-Data.db/(0,2812483216) > progress=0/2812483216 - 0% > > /var/lib/cassandra/data/DFS/main-f-1545-Data.db/(0,107648943),(107648943,434575065),(434575065,436667186) > progress=0/436667186 - 0% > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 0 134283 > Responses n/a 0 192438 > That is a little long but every case is diffent par. With low requiest load and some heavy server iron RAID,RAM you can see a compaction move really fast 300 GB in 4-6 hours. With enough load one of these operations compact,cleanup,join can get really bogged down to the point where it almost does not move. Sometimes that is just the way it is based on how fragmented your rows are and how fast your gear is. Not pushing your Cassandra caches up to your JVM limit can help. If your heap is often near full you can have jvm memory fragmentation which slows things down. 0.8 has some more tuning options for compaction, multi-threaded, knobs for effective rate. I notice you are using: 5 GB heap 8 GB ram So your RAM/DATA ratio is on the lower site. I think unless you have a good use case for row cache less XMx is more, but that is a minor tweak.