>Repairs causing a node to OOM is not unusual.   I've been working with a
customer in this situation the past few weeks.  Getting fixes out, or
mitigating the problem, is not always as quick as one hopes (see my
previous comment about how the repair_session_size setting gets easily
clobbered today).  This situation would be much improved with table
priority and tracking is added to the system_distributed table(s).
I agree we would need to tackle this OOM / JVM crashing scenario
eventually, but on the other hand, adding table-level tracking looks easy
but to perfect it, it would take some effort, say we would have to handle
all corner-case scenarios, such as cleaning the state metadata, what would
happen if there is a race condition that table was dropped but metadata
could not. Architecture extension is simple, but making it bug-free and
robust is a bit complex.

>Does this emergency list imply then not doing --partitioner-range ?
The emergency list is to prioritize a few nodes over others, but those
nodes will continue to honor the same repair configuration that has been
provided. The default configuration is to repair primary token ranges only.

>For per-table custom-priorities and tracking it sounds like adding a
clustering column.  So the number of records would go from ~number of nodes
in the cluster, to ~number of nodes multiplied by up to the number of
tables in the cluster.  We do see clusters too often with up to a thousand
tables, despite strong recommendations not to go over two hundred.  Do you
see here any concern ?
My initial thoughts are to add this as a CQL table property, something like
"repair_pririty=0.0", with all tables having the same priority. But the
user can change the priority through ALTER, say, "ALTER TABLE T1 WITH
repair_pririty=0.1", then T1 will be prioritized over other tables. Again,
I need to give more thought to it and need to do a small discussion either
in a bi-weekly meeting or on a ticket to ensure all folks are on the same
page. If we go with this approach, we do not need to add any additional
columns to the repair metadata tables, so that way the design continues to
remain lightweight, etc.
For a moment, let's just assume we add a new clustering column to track
tables. After that, the number of rows will be = <total_nodes> *
<total_tables>, which is still not an issue. As I mentioned above, the
bigger problem for table-tracking is not the architecture extension, but
perfecting with all race conditions is a bit complex.

>Also, in what versions will we be able to introduce such improvements ? We
will be waiting until the next major release ?  Playing around with the
schema of system tables in release branches is not much fun.
There are a few items on the priority beyond the CEP-37 MVP scope that some
of us are working on:
1. Extend disk-capacity check for full repair - Jaydeep
2. Making the incremental repair more reliable by having an unrepaired
size-based token splitter - Andy T, Chris L
3. Add support for the Preview Repair - Kristijonas
4. Start a new ML discussion gauge consensus on whether repairs should be
backward/forwards compatible between major versions in the future - Andy T

On top of the above list, here is my recommendation (this is just a pure
thought only and subject to change depending on how all the community
members see it):

   - Nov-Dec: We can definitely prioritize the table-level priority
   feature, which would address many concerns - Jaydeep (I can take the lead
   for a small discussion followed by implementation)
   - Nov-Feb: For table-level tracking, we can divide it into two parts:
      - (Part-1) Nov-Dec: A video meeting discussion among a few of us and
      see how we want to design, etc. -  Jaydeep
      - (Part-2) Dec-Feb: Based on the above design, implement accordingly
      - *TODO*


Jaydeep


On Tue, Oct 29, 2024 at 12:06 PM Mick Semb Wever <m...@apache.org> wrote:

>
> Jaydeep,
>   your replies address my main concerns, there's a few questions of
> curiosity as replies inline below…
>
>
>
>
>
>> >Without any per-table scheduling and history (IIUC)  a node would have
>> to restart the repairs for all keyspaces and tables.
>>
>> The above-mentioned quote should work fine and will make sure the bad
>> tables/keyspaces are skipped, allowing the good keyspaces/tables to proceed
>> on a node as long as the Cassandra JVM itself keeps crashing. If a JVM
>> keeps crashing, then it will restart all over again, but then fixing the
>> JVM crashing might be a more significant issue and does not happen
>> regularly, IMO.
>>
>
>
> Repairs causing a node to OOM is not unusual.   I've been working with a
> customer in this situation the past few weeks.  Getting fixes out, or
> mitigating the problem, is not always as quick as one hopes (see my
> previous comment about how the repair_session_size setting gets easily
> clobbered today).  This situation would be much improved with table
> priority and tracking is added to the system_distributed table(s).
>
>
>
>> If an admin sets some nodes on a priority queue, those nodes will be
>> repaired over the scheduler's own list. If an admin tags some nodes on the
>> emergency list, then those nodes will repair immediately. Basically, an
>> admin tells the scheduler, "*Just do what I say instead of using your
>> list of nodes*".
>>
>
>
> Does this emergency list imply then not doing --partitioner-range ?
>
>
> >I am also curious as to how the impact of these tables changes as we
>> address (1) and (2).
>>
>> Quite a lot of (1) & (2) can be addressed by just adding a new CQL
>> property, which won't even touch these metadata tables. In case we need to,
>> depending on the design for (1) & (2), it can be either addressed by adding
>> new columns and/or adding a new metadata table.
>>
>
> For per-table custom-priorities and tracking it sounds like adding a
> clustering column.  So the number of records would go from ~number of nodes
> in the cluster, to ~number of nodes multiplied by up to the number of
> tables in the cluster.  We do see clusters too often with up to a thousand
> tables, despite strong recommendations not to go over two hundred.  Do you
> see here any concern ?
>
> Also, in what versions will we be able to introduce such improvements ? We
> will be waiting until the next major release ?  Playing around with the
> schema of system tables in release branches is not much fun.
>
>
>

Reply via email to