Hello.

I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried turning
off two of the four nodes for half an hour under heavy load. As a result I
got a large volume of hints on the alive nodes - HintsColumnFamily takes
about 1.5 GB disk space on each of the nodes. It seems, these hints are
never replayed successfully.

After I bring other nodes back online, tpstats shows active handoffs, but I
can't see any writes on the target nodes.
The log indicates memory pressure - the heap is >80% full (heap size is 8GB
total, 1GB young).

A fragment of the log:
 INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: /
84.201.162.144
 INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760
used; max is 8588951552
 INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744
used; max is 8588951552
 INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392
used; max is 8588951552
 INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712
used; max is 8588951552
 WARN 18:34:09,805 Heap is 0.7978131149667941 full.  You may need to reduce
memtable and/or cache sizes.  Cassandra will now flush up to the two
largest memtables to free up memory.
 WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to
relieve memory pressure
 INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171
serialized/live bytes, 743266 ops)
 INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171
serialized/live bytes, 743266 ops)
 INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480
used; max is 8588951552
 INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056
used; max is 8588951552
 INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728
used; max is 8588951552
 INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936
used; max is 8588951552
 INFO 18:34:14,691 Completed flushing
/mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes)
 INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248
used; max is 8588951552
 INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead.
 INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections,
3774714808 used; max is 8588951552
 INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP

After taking off the load and restatring the service, I still see pending
handoffs:
$ nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed   Blocked  All
time blocked
ReadStage                         0         0        1004257
0                 0
RequestResponseStage              0         0          92555
0                 0
MutationStage                     0         0              6
0                 0
ReadRepairStage                   0         0          57773
0                 0
ReplicateOnWriteStage             0         0              0
0                 0
GossipStage                       0         0         143332
0                 0
AntiEntropyStage                  0         0              0
0                 0
MigrationStage                    0         0              0
0                 0
MemtablePostFlusher               0         0              2
0                 0
StreamStage                       0         0              0
0                 0
FlushWriter                       0         0              2
0                 0
MiscStage                         0         0              0
0                 0
InternalResponseStage             0         0              0
0                 0
HintedHandoff                     1         3             15
0                 0

These 3 handoffs remain pending for a long time (>12 hours).
Most of the time Cassandra uses 100% of one CPU core, the stack trace of
the busy thread is:
"HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843 runnable
[0x00007fa1e1146000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList$Itr.remove(ArrayList.java:808)
        at
org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908)
        at
org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857)
        at
org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850)
        at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195)
        at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150)
        at
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324)
        at
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
        at
org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
        at
org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
        at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Heap usage is also rather high, though the node isn't doing anything,
except the HH processing. Here is CMS output:
2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744
K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs
processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs]
[1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs]
[Times: user=0.28 sys=0.00, real=0.02 secs]

Eventually, after a few service restarts, the hints suddenly disappear.
Probably, the TTL expires and the hints get compacted away.


Currently my best guess is the following. Hinted handoffs are stored as
supercolumns, with one row per target node. The service tries to read them
entirely into memory for replay and fails, because the volume is too large
to fit in the heap at once.
Then the TTL expires, and the service starts to delete old subcolumns
during read. Since the underlying storage is a huge ArrayList, the deletion
is inefficient and takes forever.

So, it seems there're two problems here.
1) Hints are not paged correctly and cause significant memory pressure -
that's actually strange, since the same issue was supposedly addressed in
https://issues.apache.org/jira/browse/CASSANDRA-1327 and
https://issues.apache.org/jira/browse/CASSANDRA-3624;
2) Deletion of outdated hints doesn't work well for large hint volumes.


Any suggestions on how to make the cluster more tolerant to downtimes?

If I turn off the hinted handoff entirely, and manually run a repair after
a downtime, will it restore all the data correctly?

--
Best regards, Vladimir

Reply via email to