hi Bryan,
I guess I want to find out if there is any way to tell when data will
become consistent again in both cases.

if the node being down shorter than the max_hint_window(say 2 hours out of
3 hrs max), is there anyway to check the log or JMX etc to see if the hint
queue size back to zero or lower range?


if node goes down longer than max_hint_window time (say 4 hrs hours > our
max 3 hrs), we run repair job. What is the correct nodetool repair job
syntax to use?
in particular what is the difference between -local vs -dc? they both seems
to indicate repairing nodes within a datacenter, but for across DC network
outage, we want to repair nodes across DCs right?

thanks



On Fri, Feb 26, 2016 at 3:38 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi Jimmy,
>
> If you sustain a long downtime, repair is almost always the way to go.
>
> It seems like you're asking to what extent a cluster is able to
> recover/resync a downed peer.
>
> A peer will not attempt to reacquire all the data it has missed while
> being down. Recovery happens in a few ways:
>
> 1) Hints: Assuming that there are enough peers to satisfy your quorum
> requirements on write, the live peers will queue up these operations for up
> to max_hint_window_in_ms (from cassandra.yaml). These hints will be
> delivered once the peer recovers.
> 2) Read repair: There is a probability that read repair will happen,
> meaning that a query will trigger data consistency checks and updates _on
> the query being performed_.
> 3) Repair.
>
> If a machine goes down for longer than max_hint_window_in_ms, AFAIK you
> _will_ have missing data. If you cannot tolerate this situation, you need
> to take a look at your tunable consistency and/or trigger a repair.
>
> On Thu, Feb 25, 2016 at 7:26 PM, Jimmy Lin <y2klyf+w...@gmail.com> wrote:
>
>> so far they are not long, just some config change and restart.
>> if it is a 2 hrs downtime due to whatever reason, a repair is better
>> option than trying to figure out if replication syn finish or not?
>>
>> On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> Hmm. What are your processes when a node comes back after "a long
>>> offline"? Long enough to take the node offline and do a repair? Run the
>>> risk of serving stale data? Parallel repairs? ???
>>>
>>> So, what sort of time frames are "a long time"?
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin <y2k...@gmail.com> wrote:
>>>
>>>> hi all,
>>>>
>>>> what are the better ways to check replication overall status of cassandra 
>>>> cluster?
>>>>
>>>>  within a single DC, unless a node is down for long time, most of the time 
>>>> i feel it is pretty much non-issue and things are replicated pretty fast. 
>>>> But when a node come back from a long offline, is there a way to check 
>>>> that the node has finished its data sync with other nodes  ?
>>>>
>>>>  Now across DC, we have frequent VPN outage (sometime short sometims long) 
>>>> between DCs, i also like to know if there is a way to find how the 
>>>> replication progress between DC catching up under this condtion?
>>>>
>>>>  Also, if i understand correctly, the only gaurantee way to make sure data 
>>>> are synced is to run a complete repair job,
>>>> is that correct? I am trying to see if there is a way to "force a quick 
>>>> replication sync" between DCs after vpn outage.
>>>> Or maybe this is unnecessary, as Cassandra will catch up as fast as it 
>>>> can, there is nothing else we/(system admin) can do to make it faster or 
>>>> better?
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>
>>>
>>
>

Reply via email to