any name works for me, Jaydeep :-) I've taken a run through of the CEP, design doc, and current PR. Below are my four (rough categories of) questions. I am keen to see a MVP land, so I'm more looking at what the CEP's design might not be able to do, rather than what may or may not land in an initial implementation. There's a bit below, and some of it really would be better in the PR, feel free to take it there if deemed more constructive.
1) The need for different schedules for different tables 2) Failure mode: repairs failing and thrashing repairs for all keyspaces+tables 3) Small concerns on relying on system tables 4) Small concerns on tuning requirements (1) Alex also touched on this. I'm aware of too many reasons where this is a must-have. Many users cannot repair their clusters without tuning per-table schedules. Different gc_grace_seconds is the biggest reason. But there's also running full repairs infrequently for disk rot (or similar reason) on a table that's otherwise frequently incremental repaired (also means an incremental repair could be skipped if the full repair was currently running). Or TWCS tables where you benefit from higher frequency of incremental repair (and/or want to minimise repairs older than the current time_window). You may also want to do repairs in different DCs differently. (2) I'm curious as to how crashed repairs are handled and resumed… A problem a lot of users struggle with is where the repair on one table is enigmatically problematic, crashing or timing out, and it takes a long time to figure it out. Without any per-table scheduling and history (IIUC) a node would have to restart the repairs for all keyspaces and tables. This will lead to over-repairing some tables and never repairing others. And without such per-table tracking, I'm also kinda curious as to how we interact with manual repair invocations the user makes. There are operational requirements to do manual repairs, e.g. node replacement or if a node has been down for too long, and consistency breakages until such repair is complete. Leaving such operational requirements to this CEP's in-built scheduler is a limited approach, it may be many days before it gets to doing it, and even with node priority will it appropriately switch from primary-range to all-replica-ranges? What if the user accidently invokes an incremental repair when the in-built scheduler is expecting only to ever perform full repairs, does it know how to detect/remedy that? (3) Having stuff in system tables is brittle and a write-amplification, we have plenty of experience of this from DSE NodeSync and Reaper. Reaper's ability to store its metadata out-of-cluster is a huge benefit. Having read the design doc and PR, I am impressed how lightweight the design of the tables are. But I do still think we deserve some numbers, and a further line of questioning: what consistency guarantees do we need, how does this work cross-dc, during topology changes, does an event that introduces data-at-rest inconsistencies in the cluster then become confused/inefficient when the mechanism to repair it also now has its metadata inconsistent. For the most part this is a problem not unique to any table in system_distributed and otherwise handled, but how does the system_distributed keyspace handling of such failures impact repairs. Even with strong consistency, I would assume the design needs to be pessimistic, e.g. multiple node repairs can be started at the time. Is this true, if so how is it handled ? I am also curious as to how the impact of these tables changes as we address (1) and (2). (4) I can see how the CEP's design works well for the biggest clusters, and those with heterogeneous data-models (which often comes with larger deployment sets). But I don't think we can use this as the bar to quality or acceptance. Many smaller clusters that come with lots of keyspaces and tables have real troubles trying to get repairs to run weekly. We can't simply blame users for not having optimal data models and deployments. Carefully tuning the schedules of tables, and the cluster itself, is often a requirement – time-consuming and a real pain point. The CEP as it stands today I can, with confidence, say will simply not work for many users. Worse than that it will provide false hope, and take time and effort for users until they realise it won't work, leaving them having to revert to their previous solution. No one expects the CEP to initially handle and solve every situation, especially poor data-models and over-capacity clusters. Hope here is just a bit of discussion that can help us be informative about our limitations, and possibly save some users from thinking this is their silver bullet. The biggest aspect to this I believe is (1), but operational stability and tuning is also critical. Alex mentions the range-centric approach, which helps balance load, which in turn gives you more head room. But there's also stuff like parallel repairs, handling (dividing) repair_session_size, 256 vnode times 16 subranges on many empty tables saturating throughput, etc. I think most of these are minor and will fit into the design ok. WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the implementation of splitting repairs by partition count rather than token range as pretty crucial tbh. Also curious (maybe I missed it in the PR) how incremental repairs are getting token ranged, as this has a noticeable duplicating impact on the cost of anti-compaction. Also, is node load ever taken into account, e.g. avoiding starting repairs on nodes where too many pending compactions, hints being received, etc.