On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa <jji...@gmail.com> wrote: > On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin < > oleksandr.shul...@zalando.de> wrote: > On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa <jji...@gmail.com> wrote: > > To avoid violating consistency guarantees, you have to repair the replicas >> while the lost node is down >> > > How do you suggest to trigger it? Potentially replicas of the primary > range for the down node are all over the local DC, so I would go with > triggering a full cluster repair with Cassandra Reaper. But isn't it going > to fail because of the down node? > > Im not sure there’s an easy and obvious path here - this is something TLP > may want to enhance reaper to help with. > > You have to specify the ranges with -st/-et, and you have to tell it to > ignore the down host with -hosts. With vnodes you’re right that this may be > lots and lots of ranges all over the ring. > > There’s a patch proposed (maybe committed in 4.0) that makes this a > nonissue by allowing bootstrap to stream one repaired set and all of the > unrepaired replica data (which is probably very small if you’re running IR > regularly), which accomplished the same thing. >
Ouch, it really hurts to learn this. :( > It is also documented (I believe) that one should repair the node after it > finishes the "replace address" procedure. So should one repair before and > after? > > You do not need to repair after the bootstrap if you repair before. If the > docs say that, they’re wrong. The joining host gets writes during bootstrap > and consistency levels are altered during bootstrap to account for the > joining host. > This is what I had in mind (what makes replacement different from actual bootstrap of a new node): http://cassandra.apache.org/doc/latest/operating/topo_changes.html?highlight=replace%20address#replacing-a-dead-node Note If any of the following cases apply, you MUST run repair to make the replaced node consistent again, since it missed ongoing writes during/prior to bootstrapping. The *replacement* timeframe refers to the period from when the node initially dies to when a new node completes the replacement process. 1. The node is down for longer than max_hint_window_in_ms before being replaced. 2. You are replacing using the same IP address as the dead node and replacement takes longer than max_hint_window_in_ms. I would imagine that any production size instance would take way longer to replace than the default max hint window (which is 3 hours, AFAIK). Didn't remember the same IP restriction, but at least this I would also expect to be the most common setup. -- Alex