Thanks, Mick, for the comment, please find my response below. >(1)
I think I covered most of the points in my response to Alexander (except one, which I am responding to below separately). Tl;dr is the MVP that can be easily extended to do a table-level schedule; it is just going to be another CQL table property as opposed to a yaml config (currently in MVP). I had already added this as a near-term feature here and added that when we add repair priority table-wise, we need to ensure the table-level scheduling is also taken care of. Please visit my latest few comments to the ticket https://issues.apache.org/jira/browse/CASSANDRA-20013 >You may also want to do repairs in different DCs differently. Currently, the MVP allows one to skip one or more DCs if they wish to do so by defaulting all DCs. This again points to the similar theme of allowing schedule (or priority) at a table level followed by a DC level. The MVP can be easily extended at whatever granularity we want scheduling to be without many architectural changes. We all just have to finalize the granularity we want. I've also added to the ticket above that scheduling support at a table-level followed by DC-level granularity. >I'm curious as to how crashed repairs are handled and resumed The MVP has a max allowed quota at a keyspace level and at a table level. So, if a repair and/or keyspace takes much longer than the timeout due to failures/more data it needs to repair, etc., then it will skip to the next table/keyspace. >Without any per-table scheduling and history (IIUC) a node would have to restart the repairs for all keyspaces and tables. The above-mentioned quote should work fine and will make sure the bad tables/keyspaces are skipped, allowing the good keyspaces/tables to proceed on a node as long as the Cassandra JVM itself keeps crashing. If a JVM keeps crashing, then it will restart all over again, but then fixing the JVM crashing might be a more significant issue and does not happen regularly, IMO. >And without such per-table tracking, I'm also kinda curious as to how we interact with manual repair invocations the user makes. There are operational requirements to do manual repairs, e.g. node replacement or if a node has been down for too long, and consistency breakages until such repair is complete. Leaving such operational requirements to this CEP's in-built scheduler is a limited approach, it may be many days before it gets to doing it, and even with node priority will it appropriately switch from primary-range to all-replica-ranges? To alleviate some of this, the MVP has two options one can configure dynamically through *nodetool*: 1) Setting priority for nodes, 2) Telling the scheduler to repair one or more nodes immediately If an admin sets some nodes on a priority queue, those nodes will be repaired over the scheduler's own list. If an admin tags some nodes on the emergency list, then those nodes will repair immediately. Basically, an admin tells the scheduler, "*Just do what I say instead of using your list of nodes*". Even with this, if an admin decides to trigger repair manually directly through *nodetool repair*, then the scheduler should not interfere with that manually triggered operation - they can progress independently. The MVP has options to disable the scheduler's repair dynamically without any cluster restart, etc., so the admin can use some of the combinations and decide what to do when they invoke any manual repair operation. >What if the user accidentally invokes an incremental repair when the in-built scheduler is expecting only to ever perform full repairs? Does it know how to detect/remedy that? The user invocation and the scheduler invocations go through two different Repair sessions. If the MVP scheduler has been configured only to perform FR, then the scheduler will never fire IR, but it does not prohibit the user from firing IR through *nodetool repair*. As an enhancement to the MVP, in the future, we must warn the user that it might not be safe to run IR as the in-built scheduler has been configured not to do IR, etc., so be careful, etc. >Having read the design doc and PR, I am impressed how lightweight the design of the tables are. Thanks. To reiterate, the number of records in the system_distributed will be equivalent to the number of nodes in the Cluster. >But I do still think we deserve some numbers, and a further line of questioning: what consistency guarantees do we need, how does this work cross-dc, during topology changes, does an event that introduces data-at-rest inconsistencies in the cluster then become confused/inefficient when the mechanism to repair it also now has its metadata inconsistent. For the most part this is a problem not unique to any table in system_distributed and otherwise handled, but how does the system_distributed keyspace handling of such failures impact repairs. Keeping practicality in mind, the record count to the table should be as small as three rows and as big as a couple of thousands, considering production clusters worldwide, IMO. Each node is responsible for updating its record only, i.e., each node touches, at most, one record with consistency, and it uses LWT (LOCAL_QUORUM+LOCAL_SERIAL) to update the data to preserve consistency. This means, by and large, the metadata data would remain consistent for the majority of cases. Let's assume, in the worst case, say, the data becomes inconsistent between two nodes, which means these two nodes have some sort of split view. In this case, the scheduler might trigger repair on more nodes simultaneously than the parallelism is configured to do. But this worst case is possible even with the data stored outside of *system_distributed* unless we choose to store the repair metadata outside Cassandra, say, in ETCD. I believe CEP-15 should strengthen our metadata story further, and we can tweak it once CEP-15 is certified. In summary, I personally do not believe this is a massive problem as it does not break any, and with CEP-15 on the horizon, it will further strengthen the repair metadata story. We should not store this metadata outside of Cassandra because it gives a different impression to others about the Apache Cassandra project :) - we should rather solidify eventually on top of CEP-15 or something similar. >I am also curious as to how the impact of these tables changes as we address (1) and (2). Quite a lot of (1) & (2) can be addressed by just adding a new CQL property, which won't even touch these metadata tables. In case we need to, depending on the design for (1) & (2), it can be either addressed by adding new columns and/or adding a new metadata table. > I can see how the CEP's design works well for the biggest clusters, and those with heterogeneous data-models (which often comes with larger deployment sets). But I don't think we can use this as the bar to quality or acceptance. Many smaller clusters that come with lots of keyspaces and tables have real troubles trying to get repairs to run weekly. We can't simply blame users for not having optimal data models and deployments. The MVP implementation/design should work for most cases—small, medium, and enormous clusters. As I mentioned to Alexander, the scheduling is available; the default is "1d," but the config can be changed to "7d" and that cluster will be scheduled for repair only once every 7d. One can also have a different scheduling for IR and FR, say IR every 1d and the FR every 7d. My personal recommendation is to keep running the repair continuously because if we delay the repair, the amount of inconsistencies amplifies, and we have to pay a hefty price later. But the MVP does have an option if one wants to run it every 7d or say. I am even fine if we want to increase the default from 1d --> 7d. Just sharing another personal experience that we have all different cluster size combinations we are running in production, and the same design has been flawlessly working so far. 🤞 >No one expects the CEP to initially handle and solve every situation, especially poor data-models and over-capacity clusters. Hope here is just a bit of discussion that can help us be informative about our limitations, and possibly save some users from thinking this is their silver bullet. Totally agree with you that it is almost impossible to come up with a solution and default configuration that will satisfy 100% of the use cases across the world, but the idea here is to start with the most common ones, and as we learn, keep building on top of the MVP, e.g. table-level priority, etc. to make it a richer solution that might increase the coverage. That's why, if we look at the approach, then it has been incremental so far. a) It has been validated with the production use cases I have. b) Then Chris L started incorporating, and along the way, he suggested some tweaks, changes, etc., and now it works for most cases on his side. c) In the same way, we will continue engaging more users - keep tweaking, and then eventually, all of us will make this even richer solution. >The biggest aspect to this I believe is (1), but operational stability and tuning is also critical. Alex mentions the range-centric approach, which helps balance load, which in turn gives you more head room. But there's also stuff like parallel repairs, handling (dividing) repair_session_size, 256 vnode times 16 subranges on many empty tables saturating throughput, etc. I think most of these are minor and will fit into the design ok. WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the implementation of splitting repairs by partition count rather than token range as pretty crucial tbh. This has already been answered as part of Alexander's response and above. >Also curious (maybe I missed it in the PR) how incremental repairs are getting token ranged, as this has a noticeable duplicating impact on the cost of anti-compaction. The IR will split the token ranges evenly - this approach is ok for FR, but it is not optimal for IR. As I mentioned above, Andy T and Chris L have a working POC to calculate the token ranges based on unrepaired data size on top of the MVP, and they are working on adding test cases, etc., and then the MVP will be extended with that solution for IR. >Also, is node load ever taken into account, e.g. avoiding starting repairs on nodes where too many pending compactions, hints being received, etc. Currently, it does not, but this is altogether a different scope of work in Cassandra, in that we should assign priority to every work item, say Compaction, Streaming, Repair, Memtable flush, Snapshot, etc., and then monitor resource utilization and then only honor the top priority tasks when under resource crunch. I have proposed a glimpse of it in the CEP-41 <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter> (Draft), a lot more to be done here. Alex Petrov mentioned to me that he is also thinking along these lines, and will initiate a detailed discussion on this topic soon. Jaydeep On Mon, Oct 28, 2024 at 9:21 AM Mick Semb Wever <m...@apache.org> wrote: > any name works for me, Jaydeep :-) > > I've taken a run through of the CEP, design doc, and current PR. Below > are my four (rough categories of) questions. > I am keen to see a MVP land, so I'm more looking at what the CEP's design > might not be able to do, rather than what may or may not land in an initial > implementation. There's a bit below, and some of it really would be better > in the PR, feel free to take it there if deemed more constructive. > > > 1) The need for different schedules for different tables > 2) Failure mode: repairs failing and thrashing repairs for all > keyspaces+tables > 3) Small concerns on relying on system tables > 4) Small concerns on tuning requirements > > > (1) > Alex also touched on this. I'm aware of too many reasons where this is a > must-have. Many users cannot repair their clusters without tuning > per-table schedules. Different gc_grace_seconds is the biggest reason. > But there's also running full repairs infrequently for disk rot (or similar > reason) on a table that's otherwise frequently incremental repaired (also > means an incremental repair could be skipped if the full repair was > currently running). Or TWCS tables where you benefit from higher frequency > of incremental repair (and/or want to minimise repairs older than the > current time_window). You may also want to do repairs in different DCs > differently. > > (2) > I'm curious as to how crashed repairs are handled and resumed… > A problem a lot of users struggle with is where the repair on one table is > enigmatically problematic, crashing or timing out, and it takes a long time > to figure it out. > Without any per-table scheduling and history (IIUC) a node would have to > restart the repairs for all keyspaces and tables. This will lead to > over-repairing some tables and never repairing others. > > And without such per-table tracking, I'm also kinda curious as to how we > interact with manual repair invocations the user makes. > > There are operational requirements to do manual repairs, e.g. node > replacement or if a node has been down for too long, and consistency > breakages until such repair is complete. Leaving such operational > requirements to this CEP's in-built scheduler is a limited approach, it may > be many days before it gets to doing it, and even with node priority will > it appropriately switch from primary-range to all-replica-ranges? > > What if the user accidently invokes an incremental repair when the > in-built scheduler is expecting only to ever perform full repairs, does it > know how to detect/remedy that? > > > (3) > Having stuff in system tables is brittle and a write-amplification, we > have plenty of experience of this from DSE NodeSync and Reaper. Reaper's > ability to store its metadata out-of-cluster is a huge benefit. Having > read the design doc and PR, I am impressed how lightweight the design of > the tables are. But I do still think we deserve some numbers, and a > further line of questioning: what consistency guarantees do we need, how > does this work cross-dc, during topology changes, does an event that > introduces data-at-rest inconsistencies in the cluster then become > confused/inefficient when the mechanism to repair it also now has its > metadata inconsistent. For the most part this is a problem not unique to > any table in system_distributed and otherwise handled, but how does the > system_distributed keyspace handling of such failures impact repairs. > > Even with strong consistency, I would assume the design needs to be > pessimistic, e.g. multiple node repairs can be started at the time. Is > this true, if so how is it handled ? > > I am also curious as to how the impact of these tables changes as we > address (1) and (2). > > (4) > I can see how the CEP's design works well for the biggest clusters, and > those with heterogeneous data-models (which often comes with larger > deployment sets). But I don't think we can use this as the bar to quality > or acceptance. Many smaller clusters that come with lots of keyspaces and > tables have real troubles trying to get repairs to run weekly. We can't > simply blame users for not having optimal data models and deployments. > > Carefully tuning the schedules of tables, and the cluster itself, is often > a requirement – time-consuming and a real pain point. The CEP as it stands > today I can, with confidence, say will simply not work for many users. > Worse than that it will provide false hope, and take time and effort for > users until they realise it won't work, leaving them having to revert to > their previous solution. No one expects the CEP to initially handle and > solve every situation, especially poor data-models and over-capacity > clusters. Hope here is just a bit of discussion that can help us be > informative about our limitations, and possibly save some users from > thinking this is their silver bullet. > > The biggest aspect to this I believe is (1), but operational stability and > tuning is also critical. Alex mentions the range-centric approach, which > helps balance load, which in turn gives you more head room. But there's > also stuff like parallel repairs, handling (dividing) repair_session_size, > 256 vnode times 16 subranges on many empty tables saturating throughput, > etc. I think most of these are minor and will fit into the design ok. > WRT 256 vnode multiplied by 16 subranges running on tiny tables, I see the > implementation of splitting repairs by partition count rather than token > range as pretty crucial tbh. Also curious (maybe I missed it in the PR) > how incremental repairs are getting token ranged, as this has a noticeable > duplicating impact on the cost of anti-compaction. > > Also, is node load ever taken into account, e.g. avoiding starting > repairs on nodes where too many pending compactions, hints being received, > etc. > >