We are using v2.0.11 and have seen several instances in our 24 node cluster where the node becomes unresponsive, when we look into it we find that there is a cassandra process chewing up a lot of CPU. There are no other indications in logs or anything as to what might be happening, however if we strace the process that is chewing up CPU we see a segmental fault:
--- SIGSEGV (Segmentation fault) @ 0 (0) --- rt_sigreturn(0x7fd61110f862) = 30618997712 futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0 futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1 And this happens over and over again while running strafe. Has anyone seen this? Does anyone have any ideas what might be happening, or how we could debug it further? Thanks for your help, Stan