On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa <jji...@gmail.com> wrote:

> On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
> On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
> To avoid violating consistency guarantees, you have to repair the replicas
>> while the lost node is down
>>
>
> How do you suggest to trigger it?  Potentially replicas of the primary
> range for the down node are all over the local DC, so I would go with
> triggering a full cluster repair with Cassandra Reaper.  But isn't it going
> to fail because of the down node?
>
> Im not sure there’s an easy and obvious path here - this is something TLP
> may want to enhance reaper to help with.
>
> You have to specify the ranges with -st/-et, and you have to tell it to
> ignore the down host with -hosts. With vnodes you’re right that this may be
> lots and lots of ranges all over the ring.
>
> There’s a patch proposed (maybe committed in 4.0) that makes this a
> nonissue by allowing bootstrap to stream one repaired set and all of the
> unrepaired replica data (which is probably very small if you’re running IR
> regularly), which accomplished the same thing.
>

Ouch, it really hurts to learn this. :(

> It is also documented (I believe) that one should repair the node after it
> finishes the "replace address" procedure.  So should one repair before and
> after?
>
> You do not need to repair after the bootstrap if you repair before. If the
> docs say that, they’re wrong. The joining host gets writes during bootstrap
> and consistency levels are altered during bootstrap to account for the
> joining host.
>

This is what I had in mind (what makes replacement different from actual
bootstrap of a new node):
http://cassandra.apache.org/doc/latest/operating/topo_changes.html?highlight=replace%20address#replacing-a-dead-node


Note

If any of the following cases apply, you MUST run repair to make the replaced
node consistent again, since it missed ongoing writes during/prior to
bootstrapping. The *replacement* timeframe refers to the period from when
the node initially dies to when a new node completes the replacement
process.


   1. The node is down for longer than max_hint_window_in_ms before being
      replaced.
      2. You are replacing using the same IP address as the dead node and
      replacement takes longer than max_hint_window_in_ms.


I would imagine that any production size instance would take way longer to
replace than the default max hint window (which is 3 hours, AFAIK).  Didn't
remember the same IP restriction, but at least this I would also expect to
be the most common setup.

--
Alex

Reply via email to