Scenarios when a node can be missing writes

Hannu Kröger Tue, 22 Mar 2016 04:32:56 -0700

Hi,

I'm trying to reason the possible scenarios when a node of a C* cluster is
not getting the writes and the data needs some sort of anti-entropy
(repair, read-repair, etc.). In what cases does the coordinator not realize
that a write failed and doesn't replay the write from hinted handoff table?


1) The obvious case: A node is down and doesn't recover before hinted
handoff seconds has passed. Or hinted handoff is disabled altogether. In
this case a node will miss data and repair is needed.

2) Another obvious: Disk / filesystem problems. Repair is needed.

3) Node is up and receives the write but is too overloaded to handle it and
drops the mutation. This should be visible in tpstats as dropped mutation.
Does the write still stay in the hinted handoff table of the coordinator
and if so, when is it replayed if the node is seemingly up all the time? Or
is it assumed that if dropped mutations > 0 then repair is needed?

4) Node receives the write but goes down while writing the stuff to disk.
The write should be either in the commit log OR the coordinator does not
receive an OK for it. There is the small window (10s) when OK is given but
data is not synced to disk if "commitlog_sync" is "periodic" (which it is
by default) and "commitlog_sync_period_in_ms" is 10 seconds. Can this be a
cause of node missing writes if the server has stayed on for the whole time
and only cassandra has restarted?

Any other scenarios?

Cheers,
Hannu Kröger

Scenarios when a node can be missing writes

Reply via email to