Re: Cassandra hangs on large hinted handoffs

Edward Capriolo Thu, 23 May 2013 15:21:22 -0700

For some reason the 1.0.7 hints actually use a super column :)


On Thu, May 23, 2013 at 6:18 PM, aaron morton <aa...@thelastpickle.com>wrote:

> I know how this sounds, but upgrading to 1.1.11 is the best approach.
> 1.0X is not getting any fixes, 1.1X is the most stable and still getting
> some patches, and 1.2 is stable and in use.
>
> Hint storage has been redesigned in 1.2.
>
> Any suggestions on how to make the cluster more tolerant to downtimes?
>
> Hints are always seen as an optimisation, their success or otherwise does
> not impact the consistency guarantees.
>
> If are you dealing with a very high throughput as a work around you can
> reduce the time that hints are stored for a down node, see the yaml file
> for info.
>
> The behaviour is changes if you have lots of small or large column, this
> is the from HintedHandoff manager that selects the page size.
>
>         int pageSize = PAGE_SIZE;
>         // read less columns (mutations) per page if they are very large
>         if (hintStore.getMeanColumns() > 0)
>         {
>             int averageColumnSize = (int) (hintStore.getMeanRowSize() /
> hintStore.getMeanColumns());
>             pageSize = Math.min(PAGE_SIZE,
> DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize);
>             pageSize = Math.max(2, pageSize); // page size of 1 does not
> allow actual paging b/c of >= behavior on startColumn
>             logger_.debug("average hinted-row column size is {}; using
> pageSize of {}", averageColumnSize, pageSize);
>         }
>
> If you reduce the in_memory_compaction_limit yaml setting that would
> reduce the page size
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 21/05/2013, at 9:26 PM, Vladimir Volkov <vlad.vol...@gmail.com> wrote:
>
> Hello.
>
> I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried
> turning off two of the four nodes for half an hour under heavy load. As a
> result I got a large volume of hints on the alive nodes - HintsColumnFamily
> takes about 1.5 GB disk space on each of the nodes. It seems, these hints
> are never replayed successfully.
>
> After I bring other nodes back online, tpstats shows active handoffs, but
> I can't see any writes on the target nodes.
> The log indicates memory pressure - the heap is >80% full (heap size is
> 8GB total, 1GB young).
>
> A fragment of the log:
>  INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: /
> 84.201.162.144
>  INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760
> used; max is 8588951552
>  INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744
> used; max is 8588951552
>  INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392
> used; max is 8588951552
>  INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712
> used; max is 8588951552
>  WARN 18:34:09,805 Heap is 0.7978131149667941 full.  You may need to
> reduce memtable and/or cache sizes.  Cassandra will now flush up to the two
> largest memtables to free up memory.
>  WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to
> relieve memory pressure
>  INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171
> serialized/live bytes, 743266 ops)
>  INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171
> serialized/live bytes, 743266 ops)
>  INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480used; 
> max is
> 8588951552
>  INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056
> used; max is 8588951552
>  INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728
> used; max is 8588951552
>  INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936
> used; max is 8588951552
>  INFO 18:34:14,691 Completed flushing
> /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes)
>  INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248
> used; max is 8588951552
>  INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead.
>  INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections,
> 3774714808 used; max is 8588951552
>  INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP
>
> After taking off the load and restatring the service, I still see pending
> handoffs:
> $ nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed   Blocked
> All time blocked
> ReadStage                         0         0        1004257
> 0                 0
> RequestResponseStage              0         0          92555
> 0                 0
> MutationStage                     0         0              6
> 0                 0
> ReadRepairStage                   0         0          57773
> 0                 0
> ReplicateOnWriteStage             0         0              0
> 0                 0
> GossipStage                       0         0         143332
> 0                 0
> AntiEntropyStage                  0         0              0
> 0                 0
> MigrationStage                    0         0              0
> 0                 0
> MemtablePostFlusher               0         0              2
> 0                 0
> StreamStage                       0         0              0
> 0                 0
> FlushWriter                       0         0              2
> 0                 0
> MiscStage                         0         0              0
> 0                 0
> InternalResponseStage             0         0              0
> 0                 0
> HintedHandoff                     1         3             15
> 0                 0
>
> These 3 handoffs remain pending for a long time (>12 hours).
> Most of the time Cassandra uses 100% of one CPU core, the stack trace of
> the busy thread is:
> "HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843
> runnable [0x00007fa1e1146000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.ArrayList$Itr.remove(ArrayList.java:808)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150)
>         at
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324)
>         at
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
>         at
> org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
>         at
> org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
>         at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
>
> Heap usage is also rather high, though the node isn't doing anything,
> except the HH processing. Here is CMS output:
> 2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744
> K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs
> processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs]
> [1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs]
> [Times: user=0.28 sys=0.00, real=0.02 secs]
>
> Eventually, after a few service restarts, the hints suddenly disappear.
> Probably, the TTL expires and the hints get compacted away.
>
>
> Currently my best guess is the following. Hinted handoffs are stored as
> supercolumns, with one row per target node. The service tries to read them
> entirely into memory for replay and fails, because the volume is too large
> to fit in the heap at once.
> Then the TTL expires, and the service starts to delete old subcolumns
> during read. Since the underlying storage is a huge ArrayList, the deletion
> is inefficient and takes forever.
>
> So, it seems there're two problems here.
> 1) Hints are not paged correctly and cause significant memory pressure -
> that's actually strange, since the same issue was supposedly addressed in
> https://issues.apache.org/jira/browse/CASSANDRA-1327 and
> https://issues.apache.org/jira/browse/CASSANDRA-3624;
> 2) Deletion of outdated hints doesn't work well for large hint volumes.
>
>
> Any suggestions on how to make the cluster more tolerant to downtimes?
>
> If I turn off the hinted handoff entirely, and manually run a repair after
> a downtime, will it restore all the data correctly?
>
> --
> Best regards, Vladimir
>
>
>
>

Re: Cassandra hangs on large hinted handoffs

Reply via email to