any name works for me, Jaydeep :-)

I've taken a run through of the CEP, design doc, and current PR.  Below are
my four (rough categories of) questions.
I am keen to see a MVP land, so I'm more looking at what the CEP's design
might not be able to do, rather than what may or may not land in an initial
implementation.  There's a bit below, and some of it really would be better
in the PR, feel free to take it there if deemed more constructive.


1) The need for different schedules for different tables
2) Failure mode: repairs failing and thrashing repairs for all
keyspaces+tables
3) Small concerns on relying on system tables
4) Small concerns on tuning requirements


(1)
Alex also touched on this.  I'm aware of too many reasons where this is a
must-have.  Many users cannot repair their clusters without tuning
per-table schedules.  Different gc_grace_seconds is the biggest reason.
But there's also running full repairs infrequently for disk rot (or similar
reason) on a table that's otherwise frequently incremental repaired (also
means an incremental repair could be skipped if the full repair was
currently running).  Or TWCS tables where you benefit from higher frequency
of incremental repair (and/or want to minimise repairs older than the
current time_window).   You may also want to do repairs in different DCs
differently.

(2)
I'm curious as to how crashed repairs are handled and resumed…
A problem a lot of users struggle with is where the repair on one table is
enigmatically problematic, crashing or timing out, and it takes a long time
to figure it out.
Without any per-table scheduling and history (IIUC)  a node would have to
restart the repairs for all keyspaces and tables.  This will lead to
over-repairing some tables and never repairing others.

And without such per-table tracking, I'm also kinda curious as to how we
interact with manual repair invocations the user makes.

There are operational requirements to do manual repairs, e.g. node
replacement or if a node has been down for too long, and consistency
breakages until such repair is complete.  Leaving such operational
requirements to this CEP's in-built scheduler is a limited approach, it may
be many days before it gets to doing it, and even with node priority will
it appropriately switch from primary-range to all-replica-ranges?

What if the user accidently invokes an incremental repair when the in-built
scheduler is expecting only to ever perform full repairs, does it know how
to detect/remedy that?


(3)
Having stuff in system tables is brittle and a write-amplification, we have
plenty of experience of this from DSE NodeSync and Reaper.  Reaper's
ability to store its metadata out-of-cluster is a huge benefit.  Having
read the design doc and PR, I am impressed how lightweight the design of
the tables are.  But I do still think we deserve some numbers, and a
further line of questioning:  what consistency guarantees do we need, how
does this work cross-dc, during topology changes, does an event that
introduces data-at-rest inconsistencies in the cluster then become
confused/inefficient when the mechanism to repair it also now has its
metadata inconsistent.  For the most part this is a problem not unique to
any table in system_distributed and otherwise handled, but how does the
system_distributed keyspace handling of such failures impact repairs.

Even with strong consistency, I would assume the design needs to be
pessimistic, e.g. multiple node repairs can be started at the time.  Is
this true, if so how is it handled ?

I am also curious as to how the impact of these tables changes as we
address (1) and (2).

(4)
I can see how the CEP's design works well for the biggest clusters, and
those with heterogeneous data-models (which often comes with larger
deployment sets).  But I don't think we can use this as the bar to quality
or acceptance.   Many smaller clusters that come with lots of keyspaces and
tables have real troubles trying to get repairs to run weekly.  We can't
simply blame users for not having optimal data models and deployments.

Carefully tuning the schedules of tables, and the cluster itself, is often
a requirement – time-consuming and a real pain point.  The CEP as it stands
today I can, with confidence, say will simply not work for many users.
Worse than that it will provide false hope, and take time and effort for
users until they realise it won't work, leaving them having to revert to
their previous solution.   No one expects the CEP to initially handle and
solve every situation, especially poor data-models and over-capacity
clusters.  Hope here is just a bit of discussion that can help us be
informative about our limitations, and possibly save some users from
thinking this is their silver bullet.

The biggest aspect to this I believe is (1), but operational stability and
tuning is also critical.  Alex mentions the range-centric approach, which
helps balance load, which in turn gives you more head room.  But there's
also stuff like parallel repairs, handling (dividing) repair_session_size,
256 vnode times 16 subranges on many empty tables saturating throughput,
etc.    I think most of these are minor and will fit into the design ok.
WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the
implementation of splitting repairs by partition count rather than token
range as pretty crucial tbh.   Also curious (maybe I missed it in the PR)
how incremental repairs are getting token ranged, as this has a noticeable
duplicating impact on the cost of anti-compaction.

Also,  is node load ever taken into account, e.g. avoiding starting repairs
on nodes where too many pending compactions, hints being received, etc.

Reply via email to