Re: [Discuss] Repair inside C*

Jaydeep Chovatia Mon, 28 Oct 2024 14:47:26 -0700

Thanks, Mick, for the comment, please find my response below.

>(1)

I think I covered most of the points in my response to Alexander (except
one, which I am responding to below separately). Tl;dr is the MVP that can
be easily extended to do a table-level schedule; it is just going to be
another CQL table property as opposed to a yaml config (currently in MVP).
I had already added this as a near-term feature here and added that when we
add repair priority table-wise, we need to ensure the table-level
scheduling is also taken care of. Please visit my latest few comments to
the ticket https://issues.apache.org/jira/browse/CASSANDRA-20013

>You may also want to do repairs in different DCs differently.

Currently, the MVP allows one to skip one or more DCs if they wish to do so
by defaulting all DCs. This again points to the similar theme of allowing
schedule (or priority) at a table level followed by a DC level. The MVP can
be easily extended at whatever granularity we want scheduling to be without
many architectural changes. We all just have to finalize the granularity we
want. I've also added to the ticket above that scheduling support at a
table-level followed by DC-level granularity.

>I'm curious as to how crashed repairs are handled and resumed

The MVP has a max allowed quota at a keyspace level and at a table level.
So, if a repair and/or keyspace takes much longer than the timeout due to
failures/more data it needs to repair, etc., then it will skip to the next
table/keyspace.

>Without any per-table scheduling and history (IIUC)  a node would have to
restart the repairs for all keyspaces and tables.

The above-mentioned quote should work fine and will make sure the bad
tables/keyspaces are skipped, allowing the good keyspaces/tables to proceed
on a node as long as the Cassandra JVM itself keeps crashing. If a JVM
keeps crashing, then it will restart all over again, but then fixing the
JVM crashing might be a more significant issue and does not happen
regularly, IMO.

>And without such per-table tracking, I'm also kinda curious as to how we
interact with manual repair invocations the user makes.  There are
operational requirements to do manual repairs, e.g. node replacement or if
a node has been down for too long, and consistency breakages until such
repair is complete.  Leaving such operational requirements to this CEP's
in-built scheduler is a limited approach, it may be many days before it
gets to doing it, and even with node priority will it appropriately switch
from primary-range to all-replica-ranges?

To alleviate some of this, the MVP has two options one can configure
dynamically through *nodetool*: 1) Setting priority for nodes, 2) Telling
the scheduler to repair one or more nodes immediately
If an admin sets some nodes on a priority queue, those nodes will be
repaired over the scheduler's own list. If an admin tags some nodes on the
emergency list, then those nodes will repair immediately. Basically, an
admin tells the scheduler, "*Just do what I say instead of using your list
of nodes*".
Even with this, if an admin decides to trigger repair manually directly
through *nodetool repair*, then the scheduler should not interfere with
that manually triggered operation - they can progress independently. The
MVP has options to disable the scheduler's repair dynamically without any
cluster restart, etc., so the admin can use some of the combinations and
decide what to do when they invoke any manual repair operation.

>What if the user accidentally invokes an incremental repair when the
in-built scheduler is expecting only to ever perform full repairs? Does it
know how to detect/remedy that?

The user invocation and the scheduler invocations go through two different
Repair sessions. If the MVP scheduler has been configured only to perform
FR, then the scheduler will never fire IR, but it does not prohibit the
user from firing IR through *nodetool repair*. As an enhancement to the
MVP, in the future, we must warn the user that it might not be safe to run
IR as the in-built scheduler has been configured not to do IR, etc., so be
careful, etc.

>Having read the design doc and PR, I am impressed how lightweight the
design of the tables are.

Thanks. To reiterate, the number of records in the system_distributed will
be equivalent to the number of nodes in the Cluster.

>But I do still think we deserve some numbers, and a further line of
questioning:  what consistency guarantees do we need, how does this work
cross-dc, during topology changes, does an event that introduces
data-at-rest inconsistencies in the cluster then become
confused/inefficient when the mechanism to repair it also now has its
metadata inconsistent.  For the most part this is a problem not unique to
any table in system_distributed and otherwise handled, but how does the
system_distributed keyspace handling of such failures impact repairs.

Keeping practicality in mind, the record count to the table should be as
small as three rows and as big as a couple of thousands, considering
production clusters worldwide, IMO.
Each node is responsible for updating its record only, i.e., each node
touches, at most, one record with consistency, and it uses LWT
(LOCAL_QUORUM+LOCAL_SERIAL) to update the data to preserve consistency.
This means, by and large, the metadata data would remain consistent for the
majority of cases. Let's assume, in the worst case, say, the data becomes
inconsistent between two nodes, which means these two nodes have some sort
of split view. In this case, the scheduler might trigger repair on more
nodes simultaneously than the parallelism is configured to do. But this
worst case is possible even with the data stored outside of
*system_distributed* unless we choose to store the repair metadata outside
Cassandra, say, in ETCD. I believe CEP-15 should strengthen our metadata
story further, and we can tweak it once CEP-15 is certified.
In summary, I personally do not believe this is a massive problem as it
does not break any, and with CEP-15 on the horizon, it will further
strengthen the repair metadata story. We should not store this metadata
outside of Cassandra because it gives a different impression to others
about the Apache Cassandra project :) - we should rather solidify
eventually on top of CEP-15 or something similar.

>I am also curious as to how the impact of these tables changes as we
address (1) and (2).

Quite a lot of (1) & (2) can be addressed by just adding a new CQL
property, which won't even touch these metadata tables. In case we need to,
depending on the design for (1) & (2), it can be either addressed by adding
new columns and/or adding a new metadata table.

> I can see how the CEP's design works well for the biggest clusters, and
those with heterogeneous data-models (which often comes with larger
deployment sets).  But I don't think we can use this as the bar to quality
or acceptance.   Many smaller clusters that come with lots of keyspaces and
tables have real troubles trying to get repairs to run weekly.  We can't
simply blame users for not having optimal data models and deployments.

The MVP implementation/design should work for most cases—small, medium, and
enormous clusters. As I mentioned to Alexander, the scheduling is
available; the default is "1d," but the config can be changed to "7d" and
that cluster will be scheduled for repair only once every 7d.  One can also
have a different scheduling for IR and FR, say IR every 1d and the FR every
7d.
My personal recommendation is to keep running the repair
continuously because if we delay the repair, the amount of inconsistencies
amplifies, and we have to pay a hefty price later. But the MVP does have an
option if one wants to run it every 7d or say. I am even fine if we want to
increase the default from 1d --> 7d. Just sharing another personal
experience that we have all different cluster size combinations we are
running in production, and the same design has been flawlessly working so
far. 🤞

>No one expects the CEP to initially handle and solve every situation,
especially poor data-models and over-capacity clusters. Hope here is just a
bit of discussion that can help us be informative about our limitations,
and possibly save some users from thinking this is their silver bullet.

Totally agree with you that it is almost impossible to come up with a
solution and default configuration that will satisfy 100% of the use cases
across the world, but the idea here is to start with the most common ones,
and as we learn, keep building on top of the MVP, e.g. table-level
priority, etc. to make it a richer solution that might increase the
coverage. That's why, if we look at the approach, then it has been
incremental so far.
a) It has been validated with the production use cases I have. b) Then
Chris L started incorporating, and along the way, he suggested some tweaks,
changes, etc., and now it works for most cases on his side. c) In the same
way, we will continue engaging more users - keep tweaking, and then
eventually, all of us will make this even richer solution.

>The biggest aspect to this I believe is (1), but operational stability and
tuning is also critical.  Alex mentions the range-centric approach, which
helps balance load, which in turn gives you more head room.  But there's
also stuff like parallel repairs, handling (dividing) repair_session_size,
256 vnode times 16 subranges on many empty tables saturating throughput,
etc.    I think most of these are minor and will fit into the design ok.
WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the
implementation of splitting repairs by partition count rather than token
range as pretty crucial tbh.

This has already been answered as part of Alexander's response and above.

>Also curious (maybe I missed it in the PR) how incremental repairs are
getting token ranged, as this has a noticeable duplicating impact on the
cost of anti-compaction.

The IR will split the token ranges evenly - this approach is ok for FR, but
it is not optimal for IR.  As I mentioned above, Andy T and Chris L have a
working POC to calculate the token ranges based on unrepaired data size on
top of the MVP, and they are working on adding test cases, etc., and then
the MVP will be extended with that solution for IR.

>Also,  is node load ever taken into account, e.g. avoiding starting
repairs on nodes where too many pending compactions, hints being received,
etc.

Currently, it does not, but this is altogether a different scope of work in
Cassandra, in that we should assign priority to every work item, say
Compaction, Streaming, Repair, Memtable flush, Snapshot, etc., and then
monitor resource utilization and then only honor the top priority tasks
when under resource crunch. I have proposed a glimpse of it in the CEP-41
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter>
(Draft), a lot more to be done here. Alex Petrov mentioned to me that he is
also thinking along these lines, and will initiate a detailed discussion on
this topic soon.

Jaydeep

On Mon, Oct 28, 2024 at 9:21 AM Mick Semb Wever <m...@apache.org> wrote:

> any name works for me, Jaydeep :-)
>
> I've taken a run through of the CEP, design doc, and current PR.  Below
> are my four (rough categories of) questions.
> I am keen to see a MVP land, so I'm more looking at what the CEP's design
> might not be able to do, rather than what may or may not land in an initial
> implementation.  There's a bit below, and some of it really would be better
> in the PR, feel free to take it there if deemed more constructive.
>
>
> 1) The need for different schedules for different tables
> 2) Failure mode: repairs failing and thrashing repairs for all
> keyspaces+tables
> 3) Small concerns on relying on system tables
> 4) Small concerns on tuning requirements
>
>
> (1)
> Alex also touched on this.  I'm aware of too many reasons where this is a
> must-have.  Many users cannot repair their clusters without tuning
> per-table schedules.  Different gc_grace_seconds is the biggest reason.
> But there's also running full repairs infrequently for disk rot (or similar
> reason) on a table that's otherwise frequently incremental repaired (also
> means an incremental repair could be skipped if the full repair was
> currently running).  Or TWCS tables where you benefit from higher frequency
> of incremental repair (and/or want to minimise repairs older than the
> current time_window).   You may also want to do repairs in different DCs
> differently.
>
> (2)
> I'm curious as to how crashed repairs are handled and resumed…
> A problem a lot of users struggle with is where the repair on one table is
> enigmatically problematic, crashing or timing out, and it takes a long time
> to figure it out.
> Without any per-table scheduling and history (IIUC)  a node would have to
> restart the repairs for all keyspaces and tables.  This will lead to
> over-repairing some tables and never repairing others.
>
> And without such per-table tracking, I'm also kinda curious as to how we
> interact with manual repair invocations the user makes.
>
> There are operational requirements to do manual repairs, e.g. node
> replacement or if a node has been down for too long, and consistency
> breakages until such repair is complete.  Leaving such operational
> requirements to this CEP's in-built scheduler is a limited approach, it may
> be many days before it gets to doing it, and even with node priority will
> it appropriately switch from primary-range to all-replica-ranges?
>
> What if the user accidently invokes an incremental repair when the
> in-built scheduler is expecting only to ever perform full repairs, does it
> know how to detect/remedy that?
>
>
> (3)
> Having stuff in system tables is brittle and a write-amplification, we
> have plenty of experience of this from DSE NodeSync and Reaper.  Reaper's
> ability to store its metadata out-of-cluster is a huge benefit.  Having
> read the design doc and PR, I am impressed how lightweight the design of
> the tables are.  But I do still think we deserve some numbers, and a
> further line of questioning:  what consistency guarantees do we need, how
> does this work cross-dc, during topology changes, does an event that
> introduces data-at-rest inconsistencies in the cluster then become
> confused/inefficient when the mechanism to repair it also now has its
> metadata inconsistent.  For the most part this is a problem not unique to
> any table in system_distributed and otherwise handled, but how does the
> system_distributed keyspace handling of such failures impact repairs.
>
> Even with strong consistency, I would assume the design needs to be
> pessimistic, e.g. multiple node repairs can be started at the time.  Is
> this true, if so how is it handled ?
>
> I am also curious as to how the impact of these tables changes as we
> address (1) and (2).
>
> (4)
> I can see how the CEP's design works well for the biggest clusters, and
> those with heterogeneous data-models (which often comes with larger
> deployment sets).  But I don't think we can use this as the bar to quality
> or acceptance.   Many smaller clusters that come with lots of keyspaces and
> tables have real troubles trying to get repairs to run weekly.  We can't
> simply blame users for not having optimal data models and deployments.
>
> Carefully tuning the schedules of tables, and the cluster itself, is often
> a requirement – time-consuming and a real pain point.  The CEP as it stands
> today I can, with confidence, say will simply not work for many users.
> Worse than that it will provide false hope, and take time and effort for
> users until they realise it won't work, leaving them having to revert to
> their previous solution.   No one expects the CEP to initially handle and
> solve every situation, especially poor data-models and over-capacity
> clusters.  Hope here is just a bit of discussion that can help us be
> informative about our limitations, and possibly save some users from
> thinking this is their silver bullet.
>
> The biggest aspect to this I believe is (1), but operational stability and
> tuning is also critical.  Alex mentions the range-centric approach, which
> helps balance load, which in turn gives you more head room.  But there's
> also stuff like parallel repairs, handling (dividing) repair_session_size,
> 256 vnode times 16 subranges on many empty tables saturating throughput,
> etc.    I think most of these are minor and will fit into the design ok.
> WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the
> implementation of splitting repairs by partition count rather than token
> range as pretty crucial tbh.   Also curious (maybe I missed it in the PR)
> how incremental repairs are getting token ranged, as this has a noticeable
> duplicating impact on the cost of anti-compaction.
>
> Also,  is node load ever taken into account, e.g. avoiding starting
> repairs on nodes where too many pending compactions, hints being received,
> etc.
>
>

Re: [Discuss] Repair inside C*

Reply via email to