Re: how to read parent_repair_history table?

Jimmy Lin Thu, 25 Feb 2016 19:17:43 -0800

hi Paulo,
that is right, I forgot there is another table that actually tracking the
rest of the detail of the repairs.
thanks for the pointers, will explore more with those info.


I am actually surprised not much doc out there talk about these two tables,
or other tools or utilities harvesting these data. (?)

thanks



On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> > how does it work when repair job targeting only local vs all DC? is
> there any columns or flag i can tell the difference? or does it actualy
> matter?
>
> You can not easily find out from the parent_repair_session table if a
> repair is local-only or multi-dc. I created
> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
> information to that table. Since that table only has id as primary key,
> you'd need to do a full scan to perform checks on it, or keep track of the
> parent id session when submitting the repair and query by primary key.
>
> What you could probably do to health check your nodes are repaired on time
> is to check for each table:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>
> And then verify for each node if all of its ranges have been repaired in
> this period, and send an alert otherwise. You can find out a nodes range by
> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>
> To make this task a bit simpler you could probably add a secondary index
> to the participants column of repair_history table with:
>
> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>
> and check each node status individually with:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
> CONTAINS 'node_IP';
>
>
>
> 2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2k...@gmail.com>:
>
>> hi Paulo,
>>
>> one more follow up ... :)
>>
>>  I noticed these tables are suppose to replicatd to all nodes in the 
>> cluster, and it is not per node specific.
>>
>> how does it work when repair job targeting only local vs all DC? is there 
>> any columns or flag i can tell the difference?
>> or does it actualy matter?
>>
>>  thanks
>>
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pauloricard...@gmail.com>
>> wrote:
>>
>> > why each job repair execution will have 2 entries? I thought it will
>> be one entry, begining with started_at column filled, and when it
>> completed, finished_at column will be filled.
>>
>> that's correct, I was mistaken!
>>
>> > Also, if my cluster has more than 1 keyspace, and the way this table
>> is structured, it will have multiple entries, one for each keysapce_name
>> value. no ? thanks
>>
>> right, because repair sessions in different keyspaces will have different
>> repair session ids.
>>
>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2k...@gmail.com>:
>>
>>> hi Paulo,
>>>
>>> follow up on the # of entries question...
>>>
>>>  why each job repair execution will have 2 entries?
>>> I thought it will be one entry, begining with started_at column filled, and 
>>> when it completed, finished_at column will be filled.
>>>
>>> Also, if my cluster has more than 1 keyspace, and the way this table is 
>>> structured, it will have multiple entries, one for each keysapce_name 
>>> value. no ?
>>>
>>> thanks
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>
>>> Hello Jimmy,
>>>
>>> The parent_repair_history table keeps track of start and finish
>>> information of a repair session.  The other table repair_history keeps
>>> track of repair status as it progresses. So, you must first query the
>>> parent_repair_history table to check if a repair started and finish, as
>>> well as its duration, and inspect the repair_history table to troubleshoot
>>> more specific details of a given repair session.
>>>
>>> Answering your questions below:
>>>
>>> > Is every invocation of nodetool repair execution will be recorded as
>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>> node repair, or other options ?
>>>
>>> Actually two entries, one for start and one for finish.
>>>
>>> > A repair job is done only if "finished" column contains value? and a
>>> repair job is successfully done only if there is no value in exce
>>> ption_messages or exception_stacktrace ?
>>>
>>> correct
>>>
>>> > what is the purpose of successful_ranges column? do i have to check
>>> they are all matched with requested_range to ensure a successful run?
>>>
>>> correct
>>>
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a
>>> given cluster?
>>>
>>> Check if repair is being executed on all nodes within gc_grace_seconds,
>>> and tune that value or troubleshoot problems otherwise.
>>>
>>> > Scanning through parent_repair_history and making sure all the known
>>> keyspaces has a good repair run in recent days?
>>>
>>> Sounds good.
>>>
>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>>> more information.
>>>
>>>
>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2klyf+w...@gmail.com>:
>>>
>>>>
>>>> hi all,
>>>> few questions regarding how to read or digest the
>>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>>> use to find out our repair status...
>>>>
>>>> -
>>>> Is every invocation of nodetool repair execution will be recorded as
>>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>>> node repair, or other options ?
>>>>
>>>> -
>>>> A repair job is done only if "finished" column contains value? and a
>>>> repair job is successfully done only if there is no value in exce
>>>> ption_messages or exception_stacktrace ?
>>>> what is the purpose of successful_ranges column? do i have to check
>>>> they are all matched with requested_range to ensure a successful run?
>>>>
>>>> -
>>>> Ultimately, how to find out the overall repair health/status in a given
>>>> cluster?
>>>> Scanning through parent_repair_history and making sure all the known
>>>> keyspaces has a good repair run in recent days?
>>>>
>>>> ---------------
>>>> CREATE TABLE system_distributed.parent_repair_history (
>>>>     parent_id timeuuid PRIMARY KEY,
>>>>     columnfamily_names set<text>,
>>>>     exception_message text,
>>>>     exception_stacktrace text,
>>>>     finished_at timestamp,
>>>>     keyspace_name text,
>>>>     requested_ranges set<text>,
>>>>     started_at timestamp,
>>>>     successful_ranges set<text>
>>>> )
>>>>
>>>
>>>
>>
>

Re: how to read parent_repair_history table?

Reply via email to