As a follow up to this conversation; we are still having issues with our Cassandra cluster on EC2.
It *looks* to be related to Garbage Collection; however we aren't sure what the root cause of the problem is. Here is an extract from logs: INFO [GMFD:1] 2010-09-20 15:22:00,242 Gossiper.java (line 578) InetAddress /10.102.57.197 is now UP INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:22:00,242 HintedHandOffManager.java (line 165) Started hinted handoff for endPoint / 10.102.57.197 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:22:00,247 HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to endpoint /10.102.57.197 INFO [GC inspection] 2010-09-20 15:27:42,046 GCInspector.java (line 129) GC for ParNew: 325411 ms, 84284896 reclaimed leaving 640770336 used; max is 25907560448 INFO [WRITE-/10.102.7.187] 2010-09-20 15:27:42,052 OutboundTcpConnection.java (line 105) error writing to /10.102.7.187 INFO [GC inspection] 2010-09-20 15:27:42,082 GCInspector.java (line 150) Pool Name Active Pending INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156) STREAM-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156) RESPONSE-STAGE 0 3 INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156) ROW-READ-STAGE 6 25 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156) LB-OPERATIONS 0 0 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156) MISCELLANEOUS-POOL 0 0 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156) GMFD 1 129 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156) CONSISTENCY-MANAGER 0 0 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156) LB-TARGET 0 0 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156) ROW-MUTATION-STAGE 1 1 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156) MESSAGE-STREAMING-POOL 0 0 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156) LOAD-BALANCER-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156) FLUSH-SORTER-POOL 0 0 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156) MEMTABLE-POST-FLUSHER 0 0 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156) AE-SERVICE-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 156) FLUSH-WRITER-POOL 0 0 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 156) HINTED-HANDOFF-POOL 0 0 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 161) CompactionManager n/a 0 INFO [GMFD:1] 2010-09-20 15:27:42,088 Gossiper.java (line 592) Node / 10.102.7.187 has restarted, now UP again INFO [GMFD:1] 2010-09-20 15:27:42,089 StorageService.java (line 548) Node / 10.102.7.187 state jump to normal INFO [GMFD:1] 2010-09-20 15:27:42,089 StorageService.java (line 555) Will not change my token ownership to /10.102.7.187 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:27:42,089 HintedHandOffManager.java (line 165) Started hinted handoff for endPoint / 10.102.7.187 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:27:42,112 HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to endpoint /10.102.7.187 INFO [WRITE-/10.102.7.187] 2010-09-20 15:27:42,389 OutboundTcpConnection.java (line 105) error writing to /10.102.7.187 INFO [WRITE-/10.102.57.197] 2010-09-20 15:34:15,911 OutboundTcpConnection.java (line 105) error writing to /10.102.57.197 INFO [GC inspection] 2010-09-20 15:34:15,924 GCInspector.java (line 129) GC for ParNew: 372272 ms, 82471240 reclaimed leaving 671616240 used; max is 25907560448 INFO [GC inspection] 2010-09-20 15:34:15,925 GCInspector.java (line 150) Pool Name Active Pending INFO [GC inspection] 2010-09-20 15:34:15,925 GCInspector.java (line 156) STREAM-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156) RESPONSE-STAGE 0 21 INFO [Timer-0] 2010-09-20 15:34:15,926 Gossiper.java (line 180) InetAddress /10.102.7.187 is now dead. INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156) ROW-READ-STAGE 1 85 INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156) LB-OPERATIONS 0 0 INFO [GC inspection] 2010-09-20 15:34:15,935 GCInspector.java (line 156) MISCELLANEOUS-POOL 0 0 INFO [GC inspection] 2010-09-20 15:34:15,935 GCInspector.java (line 156) GMFD 1 548 INFO [GC inspection] 2010-09-20 15:34:15,935 GCInspector.java (line 156) CONSISTENCY-MANAGER 0 0 INFO [GC inspection] 2010-09-20 15:34:15,936 GCInspector.java (line 156) LB-TARGET 0 0 INFO [GC inspection] 2010-09-20 15:34:15,936 GCInspector.java (line 156) ROW-MUTATION-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:34:15,936 GCInspector.java (line 156) MESSAGE-STREAMING-POOL 0 0 INFO [GC inspection] 2010-09-20 15:34:15,937 GCInspector.java (line 156) LOAD-BALANCER-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:34:15,937 GCInspector.java (line 156) FLUSH-SORTER-POOL 0 0 INFO [GC inspection] 2010-09-20 15:34:15,937 GCInspector.java (line 156) MEMTABLE-POST-FLUSHER 0 0 INFO [GC inspection] 2010-09-20 15:34:15,938 GCInspector.java (line 156) AE-SERVICE-STAGE 0 0 INFO [GC inspection] 2010-09-20 15:34:15,938 GCInspector.java (line 156) FLUSH-WRITER-POOL 0 0 INFO [GC inspection] 2010-09-20 15:34:15,938 GCInspector.java (line 156) HINTED-HANDOFF-POOL 0 0 INFO [GC inspection] 2010-09-20 15:34:15,938 GCInspector.java (line 161) CompactionManager n/a 0 INFO [WRITE-/10.102.57.197] 2010-09-20 15:34:15,931 OutboundTcpConnection.java (line 105) error writing to /10.102.57.197 INFO [GMFD:1] 2010-09-20 15:34:15,960 Gossiper.java (line 578) InetAddress /10.102.7.187 is now UP INFO [GMFD:1] 2010-09-20 15:34:15,961 Gossiper.java (line 592) Node / 10.102.57.197 has restarted, now UP again INFO [GMFD:1] 2010-09-20 15:34:15,961 StorageService.java (line 548) Node / 10.102.57.197 state jump to normal INFO [GMFD:1] 2010-09-20 15:34:15,961 StorageService.java (line 555) Will not change my token ownership to /10.102.57.197 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:34:15,961 HintedHandOffManager.java (line 165) Started hinted handoff for endPoint / 10.102.7.187 Current thoughts - launching a GC for ParNew at 15:27:42, which takes 325411 ms, followed (almost immediately) by another at 15:34:15 which takes 372272 ms is probably sub-optimal. INFO [GC inspection] 2010-09-20 15:27:42,046 GCInspector.java (line 129) GC for ParNew: 325411 ms, 84284896 reclaimed leaving 640770336 used; max is 25907560448 INFO [GC inspection] 2010-09-20 15:34:15,924 GCInspector.java (line 129) GC for ParNew: 372272 ms, 82471240 reclaimed leaving 671616240 used; max is 25907560448 Can anyone help shed any light on why this might be happening? We've tried a variety of JVM settings to alleviate this; currently with no luck. Dave On 17 September 2010 17:04, Jedd Rashbrooke <jedd.rashbro...@imagini.net>wrote: > Hi Rob, > > Thanks for your suggestions. I should have been a bit more verbose > in my platform description -- I'm using 64-bit instances, which I think > in a Ben Black video I saw led to a sensible default usage of mmap > when left at auto. Should I look at forcing this setting? > > > You don't mention which version of the deb package you're using, but : > > I'm using 0.6.5 - the ones bundled by Eric Evans <eev...@apache.org> > > Because these are 32GB machines, I've not configured them with > any swap at all. I've rarely done this in the past - but was aware > there was this swap-hell scenario with JVM's, and the rationale > makes sense -- it's better to have the JVM crash ''cleanly" than > to have it grind the machine to a halt and make it impossible > to get onto the machine to kill the process. > > cheers, > Jedd. >