For some reason the 1.0.7 hints actually use a super column :)
On Thu, May 23, 2013 at 6:18 PM, aaron morton <aa...@thelastpickle.com>wrote: > I know how this sounds, but upgrading to 1.1.11 is the best approach. > 1.0X is not getting any fixes, 1.1X is the most stable and still getting > some patches, and 1.2 is stable and in use. > > Hint storage has been redesigned in 1.2. > > Any suggestions on how to make the cluster more tolerant to downtimes? > > Hints are always seen as an optimisation, their success or otherwise does > not impact the consistency guarantees. > > If are you dealing with a very high throughput as a work around you can > reduce the time that hints are stored for a down node, see the yaml file > for info. > > The behaviour is changes if you have lots of small or large column, this > is the from HintedHandoff manager that selects the page size. > > int pageSize = PAGE_SIZE; > // read less columns (mutations) per page if they are very large > if (hintStore.getMeanColumns() > 0) > { > int averageColumnSize = (int) (hintStore.getMeanRowSize() / > hintStore.getMeanColumns()); > pageSize = Math.min(PAGE_SIZE, > DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize); > pageSize = Math.max(2, pageSize); // page size of 1 does not > allow actual paging b/c of >= behavior on startColumn > logger_.debug("average hinted-row column size is {}; using > pageSize of {}", averageColumnSize, pageSize); > } > > If you reduce the in_memory_compaction_limit yaml setting that would > reduce the page size > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Consultant > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 21/05/2013, at 9:26 PM, Vladimir Volkov <vlad.vol...@gmail.com> wrote: > > Hello. > > I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried > turning off two of the four nodes for half an hour under heavy load. As a > result I got a large volume of hints on the alive nodes - HintsColumnFamily > takes about 1.5 GB disk space on each of the nodes. It seems, these hints > are never replayed successfully. > > After I bring other nodes back online, tpstats shows active handoffs, but > I can't see any writes on the target nodes. > The log indicates memory pressure - the heap is >80% full (heap size is > 8GB total, 1GB young). > > A fragment of the log: > INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: / > 84.201.162.144 > INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760 > used; max is 8588951552 > INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744 > used; max is 8588951552 > INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392 > used; max is 8588951552 > INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712 > used; max is 8588951552 > WARN 18:34:09,805 Heap is 0.7978131149667941 full. You may need to > reduce memtable and/or cache sizes. Cassandra will now flush up to the two > largest memtables to free up memory. > WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to > relieve memory pressure > INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171 > serialized/live bytes, 743266 ops) > INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171 > serialized/live bytes, 743266 ops) > INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480used; > max is > 8588951552 > INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056 > used; max is 8588951552 > INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728 > used; max is 8588951552 > INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936 > used; max is 8588951552 > INFO 18:34:14,691 Completed flushing > /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes) > INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248 > used; max is 8588951552 > INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead. > INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections, > 3774714808 used; max is 8588951552 > INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP > > After taking off the load and restatring the service, I still see pending > handoffs: > $ nodetool -h localhost tpstats > Pool Name Active Pending Completed Blocked > All time blocked > ReadStage 0 0 1004257 > 0 0 > RequestResponseStage 0 0 92555 > 0 0 > MutationStage 0 0 6 > 0 0 > ReadRepairStage 0 0 57773 > 0 0 > ReplicateOnWriteStage 0 0 0 > 0 0 > GossipStage 0 0 143332 > 0 0 > AntiEntropyStage 0 0 0 > 0 0 > MigrationStage 0 0 0 > 0 0 > MemtablePostFlusher 0 0 2 > 0 0 > StreamStage 0 0 0 > 0 0 > FlushWriter 0 0 2 > 0 0 > MiscStage 0 0 0 > 0 0 > InternalResponseStage 0 0 0 > 0 0 > HintedHandoff 1 3 15 > 0 0 > > These 3 handoffs remain pending for a long time (>12 hours). > Most of the time Cassandra uses 100% of one CPU core, the stack trace of > the busy thread is: > "HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843 > runnable [0x00007fa1e1146000] > java.lang.Thread.State: RUNNABLE > at java.util.ArrayList$Itr.remove(ArrayList.java:808) > at > org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908) > at > org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857) > at > org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850) > at > org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195) > at > org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150) > at > org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324) > at > org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256) > at > org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84) > at > org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > > Heap usage is also rather high, though the node isn't doing anything, > except the HH processing. Here is CMS output: > 2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744 > K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs > processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs] > [1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs] > [Times: user=0.28 sys=0.00, real=0.02 secs] > > Eventually, after a few service restarts, the hints suddenly disappear. > Probably, the TTL expires and the hints get compacted away. > > > Currently my best guess is the following. Hinted handoffs are stored as > supercolumns, with one row per target node. The service tries to read them > entirely into memory for replay and fails, because the volume is too large > to fit in the heap at once. > Then the TTL expires, and the service starts to delete old subcolumns > during read. Since the underlying storage is a huge ArrayList, the deletion > is inefficient and takes forever. > > So, it seems there're two problems here. > 1) Hints are not paged correctly and cause significant memory pressure - > that's actually strange, since the same issue was supposedly addressed in > https://issues.apache.org/jira/browse/CASSANDRA-1327 and > https://issues.apache.org/jira/browse/CASSANDRA-3624; > 2) Deletion of outdated hints doesn't work well for large hint volumes. > > > Any suggestions on how to make the cluster more tolerant to downtimes? > > If I turn off the hinted handoff entirely, and manually run a repair after > a downtime, will it restore all the data correctly? > > -- > Best regards, Vladimir > > > >