hi Paulo, that is right, I forgot there is another table that actually tracking the rest of the detail of the repairs. thanks for the pointers, will explore more with those info.
I am actually surprised not much doc out there talk about these two tables, or other tools or utilities harvesting these data. (?) thanks On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta <pauloricard...@gmail.com> wrote: > > how does it work when repair job targeting only local vs all DC? is > there any columns or flag i can tell the difference? or does it actualy > matter? > > You can not easily find out from the parent_repair_session table if a > repair is local-only or multi-dc. I created > https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more > information to that table. Since that table only has id as primary key, > you'd need to do a full scan to perform checks on it, or keep track of the > parent id session when submitting the repair and query by primary key. > > What you could probably do to health check your nodes are repaired on time > is to check for each table: > > select * from repair_history where keyspace = 'ks' columnfamily_name = > 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2); > > And then verify for each node if all of its ranges have been repaired in > this period, and send an alert otherwise. You can find out a nodes range by > querying JMX via StorageServiceMBean.getRangeToEndpointMap. > > To make this task a bit simpler you could probably add a secondary index > to the participants column of repair_history table with: > > CREATE INDEX myindex ON system_distributed.repair_history (participants) ; > > and check each node status individually with: > > select * from repair_history where keyspace = 'ks' columnfamily_name = > 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants > CONTAINS 'node_IP'; > > > > 2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2k...@gmail.com>: > >> hi Paulo, >> >> one more follow up ... :) >> >> I noticed these tables are suppose to replicatd to all nodes in the >> cluster, and it is not per node specific. >> >> how does it work when repair job targeting only local vs all DC? is there >> any columns or flag i can tell the difference? >> or does it actualy matter? >> >> thanks >> >> >> >> >> Sent from my iPhone >> >> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pauloricard...@gmail.com> >> wrote: >> >> > why each job repair execution will have 2 entries? I thought it will >> be one entry, begining with started_at column filled, and when it >> completed, finished_at column will be filled. >> >> that's correct, I was mistaken! >> >> > Also, if my cluster has more than 1 keyspace, and the way this table >> is structured, it will have multiple entries, one for each keysapce_name >> value. no ? thanks >> >> right, because repair sessions in different keyspaces will have different >> repair session ids. >> >> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2k...@gmail.com>: >> >>> hi Paulo, >>> >>> follow up on the # of entries question... >>> >>> why each job repair execution will have 2 entries? >>> I thought it will be one entry, begining with started_at column filled, and >>> when it completed, finished_at column will be filled. >>> >>> Also, if my cluster has more than 1 keyspace, and the way this table is >>> structured, it will have multiple entries, one for each keysapce_name >>> value. no ? >>> >>> thanks >>> >>> >>> >>> Sent from my iPhone >>> >>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pauloricard...@gmail.com> >>> wrote: >>> >>> Hello Jimmy, >>> >>> The parent_repair_history table keeps track of start and finish >>> information of a repair session. The other table repair_history keeps >>> track of repair status as it progresses. So, you must first query the >>> parent_repair_history table to check if a repair started and finish, as >>> well as its duration, and inspect the repair_history table to troubleshoot >>> more specific details of a given repair session. >>> >>> Answering your questions below: >>> >>> > Is every invocation of nodetool repair execution will be recorded as >>> one entry in parent_repair_history CF regardless if it is across DC, local >>> node repair, or other options ? >>> >>> Actually two entries, one for start and one for finish. >>> >>> > A repair job is done only if "finished" column contains value? and a >>> repair job is successfully done only if there is no value in exce >>> ption_messages or exception_stacktrace ? >>> >>> correct >>> >>> > what is the purpose of successful_ranges column? do i have to check >>> they are all matched with requested_range to ensure a successful run? >>> >>> correct >>> >>> - >>> > Ultimately, how to find out the overall repair health/status in a >>> given cluster? >>> >>> Check if repair is being executed on all nodes within gc_grace_seconds, >>> and tune that value or troubleshoot problems otherwise. >>> >>> > Scanning through parent_repair_history and making sure all the known >>> keyspaces has a good repair run in recent days? >>> >>> Sounds good. >>> >>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for >>> more information. >>> >>> >>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2klyf+w...@gmail.com>: >>> >>>> >>>> hi all, >>>> few questions regarding how to read or digest the >>>> system_distributed.parent_repair_history CF, that I am very intereted to >>>> use to find out our repair status... >>>> >>>> - >>>> Is every invocation of nodetool repair execution will be recorded as >>>> one entry in parent_repair_history CF regardless if it is across DC, local >>>> node repair, or other options ? >>>> >>>> - >>>> A repair job is done only if "finished" column contains value? and a >>>> repair job is successfully done only if there is no value in exce >>>> ption_messages or exception_stacktrace ? >>>> what is the purpose of successful_ranges column? do i have to check >>>> they are all matched with requested_range to ensure a successful run? >>>> >>>> - >>>> Ultimately, how to find out the overall repair health/status in a given >>>> cluster? >>>> Scanning through parent_repair_history and making sure all the known >>>> keyspaces has a good repair run in recent days? >>>> >>>> --------------- >>>> CREATE TABLE system_distributed.parent_repair_history ( >>>> parent_id timeuuid PRIMARY KEY, >>>> columnfamily_names set<text>, >>>> exception_message text, >>>> exception_stacktrace text, >>>> finished_at timestamp, >>>> keyspace_name text, >>>> requested_ranges set<text>, >>>> started_at timestamp, >>>> successful_ranges set<text> >>>> ) >>>> >>> >>> >> >