data loss in different DC

2017-09-28 Thread Peng Xiao
Dear All,


We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but we 
found that sometimes we can query records in DC1,while not able not find the 
same record in DC2 with local_quorum.How it happens?
Could anyone please advise?
looks we can only run repair to fix it.


Thanks,
Peng Xiao

Re: data loss in different DC

2017-09-28 Thread DuyHai Doan
If you're writing into DC1 with CL = LOCAL_xxx, there is no guarantee to be
sure to read the same data in DC2. Only repair will help you

On Thu, Sep 28, 2017 at 11:41 AM, Peng Xiao <2535...@qq.com> wrote:

> Dear All,
>
> We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but
> we found that sometimes we can query records in DC1,while not able not find
> the same record in DC2 with local_quorum.How it happens?
> Could anyone please advise?
> looks we can only run repair to fix it.
>
> Thanks,
> Peng Xiao
>


data loss in different DC

2017-09-28 Thread Peng Xiao
Dear All,


We have a cluster with one DC1:RF=3,another DC DC2:RF=1,DC2 only for ETL,but we 
found that sometimes we can query records in DC1,while not able not find the 
same record in DC2 with local_quorum.How it happens?looks data loss in DC2.
Could anyone please advise?
looks we can only run repair to fix it.


Thanks,
Peng Xiao

Re: data loss in different DC

2017-09-28 Thread Peng Xiao
very sorry for the duplicate mail.




-- Original --
From:  "";<2535...@qq.com>;
Date:  Thu, Sep 28, 2017 07:41 PM
To:  "user";

Subject:  data loss in different DC



Dear All,


We have a cluster with one DC1:RF=3,another DC DC2:RF=1,DC2 only for ETL,but we 
found that sometimes we can query records in DC1,while not able not find the 
same record in DC2 with local_quorum.How it happens?looks data loss in DC2.
Could anyone please advise?
looks we can only run repair to fix it.


Thanks,
Peng Xiao

?????? data loss in different DC

2017-09-28 Thread Peng Xiao
even with CL=QUORUM,there is no guarantee to be sure to read the same data in 
DC2,right?
then multi DCs looks make no sense?




--  --
??: "DuyHai Doan";;
: 2017??9??28??(??) 5:45
??: "user";

: Re: data loss in different DC



If you're writing into DC1 with CL = LOCAL_xxx, there is no guarantee to be 
sure to read the same data in DC2. Only repair will help you

On Thu, Sep 28, 2017 at 11:41 AM, Peng Xiao <2535...@qq.com> wrote:
Dear All,


We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but we 
found that sometimes we can query records in DC1,while not able not find the 
same record in DC2 with local_quorum.How it happens?
Could anyone please advise?
looks we can only run repair to fix it.


Thanks,
Peng Xiao

Re: 回复: data loss in different DC

2017-09-28 Thread Reynald Bourtembourg

Hi,

You can write with CL=EACH_QUORUM and read with CL=LOCAL_QUORUM to get 
strong consistency.


Kind regards,
Reynald

On 28/09/2017 13:46, Peng Xiao wrote:
even with CL=QUORUM,there is no guarantee to be sure to read the same 
data in DC2,right?

then multi DCs looks make no sense?


--?0?2?0?2--
*??:*?0?2"DuyHai Doan";;
*:*?0?22017??9??28??(??) 5:45
*??:*?0?2"user";
*:*?0?2Re: data loss in different DC

If you're writing into DC1 with CL = LOCAL_xxx, there is no guarantee 
to be sure to read the same data in DC2. Only repair will help you


On Thu, Sep 28, 2017 at 11:41 AM, Peng Xiao <2535...@qq.com 
> wrote:


Dear All,

We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for
ETL,but we found that sometimes we can query records in DC1,while
not able not find the same record in DC2 with local_quorum.How it
happens?
Could anyone please advise?
looks we can only run repair to fix it.

Thanks,
Peng Xiao






Re: 回复: data loss in different DC

2017-09-28 Thread Jacob Shadix
How often are you running repairs?

-- Jacob Shadix

On Thu, Sep 28, 2017 at 7:53 AM, Reynald Bourtembourg <
reynald.bourtembo...@esrf.fr> wrote:

> Hi,
>
> You can write with CL=EACH_QUORUM and read with CL=LOCAL_QUORUM to get
> strong consistency.
>
> Kind regards,
> Reynald
>
>
> On 28/09/2017 13:46, Peng Xiao wrote:
>
> even with CL=QUORUM,there is no guarantee to be sure to read the same data
> in DC2,right?
> then multi DCs looks make no sense?
>
>
> -- 原始邮件 --
> *发件人:* "DuyHai Doan"; ;
> *发送时间:* 2017年9月28日(星期四) 下午5:45
> *收件人:* "user" ;
> *主题:* Re: data loss in different DC
>
> If you're writing into DC1 with CL = LOCAL_xxx, there is no guarantee to
> be sure to read the same data in DC2. Only repair will help you
>
> On Thu, Sep 28, 2017 at 11:41 AM, Peng Xiao <2535...@qq.com> wrote:
>
>> Dear All,
>>
>> We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but
>> we found that sometimes we can query records in DC1,while not able not find
>> the same record in DC2 with local_quorum.How it happens?
>> Could anyone please advise?
>> looks we can only run repair to fix it.
>>
>> Thanks,
>> Peng Xiao
>>
>
>
>


Re: data loss in different DC

2017-09-28 Thread Jeff Jirsa
Your quorum writers are only guaranteed to be on half+1 nodes - there’s no 
guarantee which nodes those will be. For strong consistency with multiple DCs, 
You can either: 

- write at quorum and read at quorum from any dc, or
- write each_quorum and read local_quorum from any dc, or
- write at local_quorum and read local_quorum from the same DC only




-- 
Jeff Jirsa


> On Sep 28, 2017, at 2:41 AM, Peng Xiao <2535...@qq.com> wrote:
> 
> Dear All,
> 
> We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but we 
> found that sometimes we can query records in DC1,while not able not find the 
> same record in DC2 with local_quorum.How it happens?
> Could anyone please advise?
> looks we can only run repair to fix it.
> 
> Thanks,
> Peng Xiao

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



[no subject]

2017-09-28 Thread Dan Kinder
Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
following. The cluster does function, for a while, but then some stages
begin to back up and the node does not recover and does not drain the
tasks, even under no load. This happens both to MutationStage and
GossipStage.

I do see the following exception happen in the logs:


ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
CassandraDaemon.java:228 - Exception in thread
Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 1 responses.

at
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_91]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_91]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
~[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]


But it's hard to correlate precisely with things going bad. It is also very
strange to me since I have both read_repair_chance and
dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
confusing why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to
lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
If I can't find a resolution I'm going to need to downgrade and restore to
backup...

The only issue I found that looked similar is
https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to
be fixed by 3.10.


$ nodetool tpstats

Pool Name Active   Pending  Completed   Blocked
All time blocked

ReadStage  0 0 582103 0
  0

MiscStage  0 0  0 0
  0

CompactionExecutor1111   2868 0
  0

MutationStage 32   4593678   55057393 0
  0

GossipStage1  2818 371487 0
  0

RequestResponseStage   0 04345522 0
  0

ReadRepairStage0 0 151473 0
  0

CounterMutationStage   0 0  0 0
  0

MemtableFlushWriter181 76 0
  0

MemtablePostFlush  1   382139 0
  0

ValidationExecutor 0 0  0 0
  0

ViewMutationStage  0 0  0 0
  0

CacheCleanupExecutor   0 0  0 0
  0

PerDiskMemtableFlushWriter_10  0 0 69 0
  0

PerDiskMemtableFlushWriter_11  0 0 69 0
  0

MemtableReclaimMemory  0 0 81 0
  0

PendingRangeCalculator 0 0 32 0
  0

SecondaryIndexManagement   0 0  0 0
  0

HintsDispatcher0 0596 0
  0

PerDiskMemtableFlushWriter_1   0 0 69 0
  0

Native-Transport-Requests 11 04547746 0
  67

PerDiskMemtableFlushWriter_2   0 0 69 0
  0

MigrationStage 1  1545586 0
  0

PerDiskMemtableFlushWriter_0   0 0 80 0
  0

Sampler0 0  0 0
  0

PerDiskMemtableFlushWriter_5   0 0 69 0
  

Re:

2017-09-28 Thread Dan Kinder
I should also note, I also see nodes become locked up without seeing that
Exception. But the GossipStage buildup does seem correlated with gossip
activity, e.g. me restarting a different node.

On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder  wrote:

> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
> $ nodetool tpstats
>
> Pool Name Active   Pending  Completed
> Blocked  All time blocked
>
> ReadStage  0 0 582103 0
> 0
>
> MiscStage  0 0  0 0
> 0
>
> CompactionExecutor1111   2868 0
> 0
>
> MutationStage 32   4593678   55057393 0
> 0
>
> GossipStage1  2818 371487 0
> 0
>
> RequestResponseStage   0 04345522 0
> 0
>
> ReadRepairStage0 0 151473 0
> 0
>
> CounterMutationStage   0 0  0 0
> 0
>
> MemtableFlushWriter181 76 0
> 0
>
> MemtablePostFlush  1   382139 0
> 0
>
> ValidationExecutor 0 0  0 0
> 0
>
> ViewMutationStage  0 0  0 0
> 0
>
> CacheCleanupExecutor   0 0  0 0
> 0
>
> PerDiskMemtableFlushWriter_10  0 0 69 0
> 0
>
> PerDiskMemtableFlushWriter_11  0 0 69 0
> 0
>
> MemtableReclaimMemory  0 0 81 0
> 0
>
> PendingRangeCalculator 0 0 32 0
> 0
>
> SecondaryIndexManagement   0 0  0 0
> 0
>
> HintsDispatcher0 0596 0
> 0
>
> PerDiskMemtableFlushWriter_1   0 0 69 0
> 0
>
> Native-Transport-Requests

Re: 回复: data loss in different DC

2017-09-28 Thread Reynald Bourtembourg
Do you mean how often we should run repairs in this situation (write at 
CL=EACH_QUORUM and CL=LOCAL_QUORUM)?


I don't consider myself as an expert in this domain, so I will let the 
real experts answer to this question...
You can also refer to this mailing list history (many threads on this 
snbject) and to the recommendations in the documentation:

http://cassandra.apache.org/doc/latest/operating/repair.html#usage-and-best-practices

Kind regards
Reynald


On 28/09/2017 14:49, Jacob Shadix wrote:

How often are you running repairs?

-- Jacob Shadix

On Thu, Sep 28, 2017 at 7:53 AM, Reynald Bourtembourg 
mailto:reynald.bourtembo...@esrf.fr>> 
wrote:


Hi,

You can write with CL=EACH_QUORUM and read with CL=LOCAL_QUORUM to
get strong consistency.

Kind regards,
Reynald


On 28/09/2017 13:46, Peng Xiao wrote:

even with CL=QUORUM,there is no guarantee to be sure to read the
same data in DC2,right?
then multi DCs looks make no sense?


-- 原始邮件 --
*发件人:* "DuyHai Doan";
;
*发送时间:* 2017年9月28日(星期四) 下午5:45
*收件人:* "user"
;
*主题:* Re: data loss in different DC

If you're writing into DC1 with CL = LOCAL_xxx, there is no
guarantee to be sure to read the same data in DC2. Only repair
will help you

On Thu, Sep 28, 2017 at 11:41 AM, Peng Xiao <2535...@qq.com
> wrote:

Dear All,

We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only
for ETL,but we found that sometimes we can query records in
DC1,while not able not find the same record in DC2 with
local_quorum.How it happens?
Could anyone please advise?
looks we can only run repair to fix it.

Thanks,
Peng Xiao









Re:

2017-09-28 Thread Prem Yadav
Dan,
As part of upgrade, did you upgrade the sstables?

Sent from mobile. Please excuse typos

On 28 Sep 2017 17:45, "Dan Kinder"  wrote:

> I should also note, I also see nodes become locked up without seeing that
> Exception. But the GossipStage buildup does seem correlated with gossip
> activity, e.g. me restarting a different node.
>
> On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder  wrote:
>
>> Hi,
>>
>> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
>> following. The cluster does function, for a while, but then some stages
>> begin to back up and the node does not recover and does not drain the
>> tasks, even under no load. This happens both to MutationStage and
>> GossipStage.
>>
>> I do see the following exception happen in the logs:
>>
>>
>> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
>> CassandraDaemon.java:228 - Exception in thread
>> Thread[ReadRepairStage:2328,5,main]
>>
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>> out - received only 1 responses.
>>
>> at 
>> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat
>> ors$2.close(UnfilteredPartitionIterators.java:182)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr
>> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_91]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_91]
>>
>> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$
>> threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>>
>>
>> But it's hard to correlate precisely with things going bad. It is also
>> very strange to me since I have both read_repair_chance and
>> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
>> confusing why ReadRepairStage would err.
>>
>> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
>> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
>> If I can't find a resolution I'm going to need to downgrade and restore to
>> backup...
>>
>> The only issue I found that looked similar is https://issues.apache.org/j
>> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>>
>>
>> $ nodetool tpstats
>>
>> Pool Name Active   Pending  Completed
>> Blocked  All time blocked
>>
>> ReadStage  0 0 582103
>>   0 0
>>
>> MiscStage  0 0  0
>>   0 0
>>
>> CompactionExecutor1111   2868
>>   0 0
>>
>> MutationStage 32   4593678   55057393
>>   0 0
>>
>> GossipStage1  2818 371487
>>   0 0
>>
>> RequestResponseStage   0 04345522
>>   0 0
>>
>> ReadRepairStage0 0 151473
>>   0 0
>>
>> CounterMutationStage   0 0  0
>>   0 0
>>
>> MemtableFlushWriter181 76
>>   0 0
>>
>> MemtablePostFlush  1   382139
>>   0 0
>>
>> ValidationExecutor 0 0  0
>>   0 0
>>
>> ViewMutationStage  0 0  0
>>   0 0
>>
>> CacheCleanupExecutor   0 0  0
>>   0 0
>>
>> PerDiskMemtableFlushWriter_10  0 0 69
>>   0 0
>>
>> PerDiskMemtableFlushWriter_11  0 0 69
>>   0 0
>>
>> MemtableReclaimMemory  0 0 81
>>   0 0
>>
>> PendingRangeCalculator 0 0 32
>>   0 0
>>
>> SecondaryIndexManagement   0 0  0
>>   0 0
>>
>> HintsDispatcher0 0  

Nodetool repair -pr

2017-09-28 Thread Dmitry Buzolin
Hi All,

Can someone confirm if

"nodetool repair -pr -j2" does run with -inc too? I see the docs mention -inc 
is set by default, but I am not sure if it is enabled when -pr option is used.

Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



2.1.19 and 2.2.11 releases up for vote

2017-09-28 Thread Michael Shuler
If you don't read the dev@ list, I just built the 2.1.19 and 2.2.11
releases to vote on. Testing is always appreciated, so thought I'd let
the user@ list know, if you want to give them a whirl prior to release.

https://lists.apache.org/thread.html/7ebafc7e6228e24a9ee4e7372e8e71ef9ec868bf801eb5f31c1c20c3@%3Cdev.cassandra.apache.org%3E
https://lists.apache.org/thread.html/3f1c55ded03cd21348231231072e28436a25cc8a97e4740837885c76@%3Cdev.cassandra.apache.org%3E

-- 
Kind regards,
Michael

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re:

2017-09-28 Thread Jeff Jirsa
That read timeout looks like a nonissue (though we should catch it and
squash it differently). MigrationStage is backed up as well. Are you still
bouncing nodes? Have you fully upgraded the cluster at this point?


On Thu, Sep 28, 2017 at 9:44 AM, Dan Kinder  wrote:

> I should also note, I also see nodes become locked up without seeing that
> Exception. But the GossipStage buildup does seem correlated with gossip
> activity, e.g. me restarting a different node.
>
> On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder  wrote:
>
>> Hi,
>>
>> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
>> following. The cluster does function, for a while, but then some stages
>> begin to back up and the node does not recover and does not drain the
>> tasks, even under no load. This happens both to MutationStage and
>> GossipStage.
>>
>> I do see the following exception happen in the logs:
>>
>>
>> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
>> CassandraDaemon.java:228 - Exception in thread
>> Thread[ReadRepairStage:2328,5,main]
>>
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>> out - received only 1 responses.
>>
>> at 
>> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat
>> ors$2.close(UnfilteredPartitionIterators.java:182)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr
>> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_91]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_91]
>>
>> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$
>> threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>>
>>
>> But it's hard to correlate precisely with things going bad. It is also
>> very strange to me since I have both read_repair_chance and
>> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
>> confusing why ReadRepairStage would err.
>>
>> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
>> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
>> If I can't find a resolution I'm going to need to downgrade and restore to
>> backup...
>>
>> The only issue I found that looked similar is https://issues.apache.org/j
>> ira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>>
>>
>> $ nodetool tpstats
>>
>> Pool Name Active   Pending  Completed
>> Blocked  All time blocked
>>
>> ReadStage  0 0 582103
>>   0 0
>>
>> MiscStage  0 0  0
>>   0 0
>>
>> CompactionExecutor1111   2868
>>   0 0
>>
>> MutationStage 32   4593678   55057393
>>   0 0
>>
>> GossipStage1  2818 371487
>>   0 0
>>
>> RequestResponseStage   0 04345522
>>   0 0
>>
>> ReadRepairStage0 0 151473
>>   0 0
>>
>> CounterMutationStage   0 0  0
>>   0 0
>>
>> MemtableFlushWriter181 76
>>   0 0
>>
>> MemtablePostFlush  1   382139
>>   0 0
>>
>> ValidationExecutor 0 0  0
>>   0 0
>>
>> ViewMutationStage  0 0  0
>>   0 0
>>
>> CacheCleanupExecutor   0 0  0
>>   0 0
>>
>> PerDiskMemtableFlushWriter_10  0 0 69
>>   0 0
>>
>> PerDiskMemtableFlushWriter_11  0 0 69
>>   0 0
>>
>> MemtableReclaimMemory  0 0 81
>>   0 0
>>
>> PendingRangeCalculator 0 0 32
>>   0 0
>>
>> SecondaryIndexManagement

RE:

2017-09-28 Thread Steinmaurer, Thomas
Dan,

do you see any major GC? We have been hit by the following memory leak in our 
loadtest environment with 3.11.0.
https://issues.apache.org/jira/browse/CASSANDRA-13754

So, depending on the heap size and uptime, you might get into heap troubles.

Thomas

From: Dan Kinder [mailto:dkin...@turnitin.com]
Sent: Donnerstag, 28. September 2017 18:20
To: user@cassandra.apache.org
Subject:


Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the 
following. The cluster does function, for a while, but then some stages begin 
to back up and the node does not recover and does not drain the tasks, even 
under no load. This happens both to MutationStage and GossipStage.

I do see the following exception happen in the logs:



ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 CassandraDaemon.java:228 - 
Exception in thread Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 1 responses.

at 
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

at 
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

at 
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) 
~[apache-cassandra-3.11.0.jar:3.11.0]

at 
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

at 
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-3.11.0.jar:3.11.0]

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_91]

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[na:1.8.0_91]

at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]



But it's hard to correlate precisely with things going bad. It is also very 
strange to me since I have both read_repair_chance and 
dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is confusing 
why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to lock 
up. Once it happens Cassandra can't even shut down, I have to kill -9. If I 
can't find a resolution I'm going to need to downgrade and restore to backup...

The only issue I found that looked similar is 
https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to be 
fixed by 3.10.



$ nodetool tpstats

Pool Name Active   Pending  Completed   Blocked  
All time blocked

ReadStage  0 0 582103 0 
0

MiscStage  0 0  0 0 
0

CompactionExecutor1111   2868 0 
0

MutationStage 32   4593678   55057393 0 
0

GossipStage1  2818 371487 0 
0

RequestResponseStage   0 04345522 0 
0

ReadRepairStage0 0 151473 0 
0

CounterMutationStage   0 0  0 0 
0

MemtableFlushWriter181 76 0 
0

MemtablePostFlush  1   382139 0 
0

ValidationExecutor 0 0  0 0 
0

ViewMutationStage  0 0  0 0 
0

CacheCleanupExecutor   0 0  0 0 
0

PerDiskMemtableFlushWriter_10  0 0 69 0 
0

PerDiskMemtableFlushWriter_11  0 0 69 0 
0

MemtableReclaimMemory  0 0 81 0 
0

PendingRangeCalculator 0 0 32 0 
0

SecondaryIndexManagement   0 0  0 0 
0

HintsDispatcher0 0596 0 
0

PerDiskMemtableFlushWriter_1   0 0 69 0 
0

Native-Transport-Requests 11 04547746 

Re:

2017-09-28 Thread Dan Kinder
Thanks for the responses.

@Prem yes this is after the entire cluster is on 3.11, but no I did not run
upgradesstables yet.

@Thomas no I don't see any major GC going on.

@Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
bring it back (thankfully this cluster is not serving live traffic). The
nodes seemed okay for an hour or two, but I see the issue again, without me
bouncing any nodes. This time it's ReadStage that's building up, and the
exception I'm seeing in the logs is:

DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(6150926370328526396, 696a6374652e6f7267)
(2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)

at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_71]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_71]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]


Do you think running upgradesstables would help? Or relocatesstables? I
presumed it shouldn't be necessary for Cassandra to function, just an
optimization.

On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Dan,
>
>
>
> do you see any major GC? We have been hit by the following memory leak in
> our loadtest environment with 3.11.0.
>
> https://issues.apache.org/jira/browse/CASSANDRA-13754
>
>
>
> So, depending on the heap size and uptime, you might get into heap
> troubles.
>
>
>
> Thomas
>
>
>
> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
> *Sent:* Donnerstag, 28. September 2017 18:20
> *To:* user@cassandra.apache.org
> *Subject:*
>
>
>
> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
>
> $ nodetool tpstats
>
> Pool Name Active   Pending  Completed
> Blocked  All time blocked
>
> ReadStage  0 0   

Re:

2017-09-28 Thread Dan Kinder
Sorry, for that ReadStage exception, I take it back, accidentally ended up
too early in the logs. This node that has building ReadStage shows no
exceptions in the logs.

nodetool tpstats
Pool Name Active   Pending  Completed   Blocked
 All time blocked
ReadStage  8  1882  45881 0
0
MiscStage  0 0  0 0
0
CompactionExecutor 9 9   2551 0
0
MutationStage  0 0   35929880 0
0
GossipStage0 0  35793 0
0
RequestResponseStage   0 0 751285 0
0
ReadRepairStage0 0224 0
0
CounterMutationStage   0 0  0 0
0
MemtableFlushWriter0 0111 0
0
MemtablePostFlush  0 0239 0
0
ValidationExecutor 0 0  0 0
0
ViewMutationStage  0 0  0 0
0
CacheCleanupExecutor   0 0  0 0
0
PerDiskMemtableFlushWriter_10  0 0104 0
0
PerDiskMemtableFlushWriter_11  0 0104 0
0
MemtableReclaimMemory  0 0116 0
0
PendingRangeCalculator 0 0 16 0
0
SecondaryIndexManagement   0 0  0 0
0
HintsDispatcher0 0 13 0
0
PerDiskMemtableFlushWriter_1   0 0104 0
0
Native-Transport-Requests  0 02607030 0
0
PerDiskMemtableFlushWriter_2   0 0104 0
0
MigrationStage 0 0278 0
0
PerDiskMemtableFlushWriter_0   0 0115 0
0
Sampler0 0  0 0
0
PerDiskMemtableFlushWriter_5   0 0104 0
0
InternalResponseStage  0 0298 0
0
PerDiskMemtableFlushWriter_6   0 0104 0
0
PerDiskMemtableFlushWriter_3   0 0104 0
0
PerDiskMemtableFlushWriter_4   0 0104 0
0
PerDiskMemtableFlushWriter_9   0 0104 0
0
AntiEntropyStage   0 0  0 0
0
PerDiskMemtableFlushWriter_7   0 0104 0
0
PerDiskMemtableFlushWriter_8   0 0104 0
0

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder  wrote:

> Thanks for the responses.
>
> @Prem yes this is after the entire cluster is on 3.11, but no I did not
> run upgradesstables yet.
>
> @Thomas no I don't see any major GC going on.
>
> @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
> bring it back (thankfully this cluster is not serving live traffic). The
> nodes seemed okay for an hour or two, but I see the issue again, without me
> bouncing any nodes. This time it's ReadStage that's building up, and the
> exception I'm seeing in the logs is:
>
> DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(6150926370328526396, 696a6374652e6f7267) (
> 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)
>
> at 
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.service.ReadCallback$
> AsyncRepairRunner.run(ReadCallback.java:233)
> ~[apache-cass

Re:

2017-09-28 Thread Jeff Jirsa
The digest mismatch exception is not a problem, that's why it's only logged
at debug.

As Thomas noted, there's a pretty good chance this is
https://issues.apache.org/jira/browse/CASSANDRA-13754 - if you see a lot of
GCInspector logs indicating GC pauses, that would add confidence to that
diagnosis.  


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder  wrote:

> Thanks for the responses.
>
> @Prem yes this is after the entire cluster is on 3.11, but no I did not
> run upgradesstables yet.
>
> @Thomas no I don't see any major GC going on.
>
> @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
> bring it back (thankfully this cluster is not serving live traffic). The
> nodes seemed okay for an hour or two, but I see the issue again, without me
> bouncing any nodes. This time it's ReadStage that's building up, and the
> exception I'm seeing in the logs is:
>
> DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(6150926370328526396, 696a6374652e6f7267) (
> 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)
>
> at 
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.service.ReadCallback$
> AsyncRepairRunner.run(ReadCallback.java:233)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_71]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_71]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> [apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]
>
>
> Do you think running upgradesstables would help? Or relocatesstables? I
> presumed it shouldn't be necessary for Cassandra to function, just an
> optimization.
>
> On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Dan,
>>
>>
>>
>> do you see any major GC? We have been hit by the following memory leak in
>> our loadtest environment with 3.11.0.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-13754
>>
>>
>>
>> So, depending on the heap size and uptime, you might get into heap
>> troubles.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
>> *Sent:* Donnerstag, 28. September 2017 18:20
>> *To:* user@cassandra.apache.org
>> *Subject:*
>>
>>
>>
>> Hi,
>>
>> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
>> following. The cluster does function, for a while, but then some stages
>> begin to back up and the node does not recover and does not drain the
>> tasks, even under no load. This happens both to MutationStage and
>> GossipStage.
>>
>> I do see the following exception happen in the logs:
>>
>>
>>
>> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
>> CassandraDaemon.java:228 - Exception in thread
>> Thread[ReadRepairStage:2328,5,main]
>>
>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>> out - received only 1 responses.
>>
>> at 
>> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat
>> ors$2.close(UnfilteredPartitionIterators.java:182)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr
>> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> ~[na:1.8.0_91]
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> ~[na:1.8.0_91]
>>
>> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$
>> threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]
>>
>> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>>
>>
>>
>> But it's hard to correlate precisely with things going bad. It is also
>> very strange to me since I have both read_repair_chance and
>> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
>> confusing why ReadRepair

?????? data loss in different DC

2017-09-28 Thread Peng Xiao
Thanks All


--  --
??: "Jeff Jirsa";;
: 2017??9??28??(??) 9:16
??: "user";

: Re: data loss in different DC



Your quorum writers are only guaranteed to be on half+1 nodes - there??s no 
guarantee which nodes those will be. For strong consistency with multiple DCs, 
You can either: 

- write at quorum and read at quorum from any dc, or
- write each_quorum and read local_quorum from any dc, or
- write at local_quorum and read local_quorum from the same DC only




-- 
Jeff Jirsa


> On Sep 28, 2017, at 2:41 AM, Peng Xiao <2535...@qq.com> wrote:
> 
> Dear All,
> 
> We have a cluster with one DC1:RF=3,another DC DC2:RF=1 only for ETL,but we 
> found that sometimes we can query records in DC1,while not able not find the 
> same record in DC2 with local_quorum.How it happens?
> Could anyone please advise?
> looks we can only run repair to fix it.
> 
> Thanks,
> Peng Xiao

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

quietness of full nodetool repair on large dataset

2017-09-28 Thread Mitch Gitman
I'm on Apache Cassandra 3.10. I'm interested in moving over to Reaper for
repairs, but in the meantime, I want to get nodetool repair working a
little more gracefully.

What I'm noticing is that, when I'm running a repair for the first time
with the --full option after a large initial load of data, the client will
say it's starting on a repair job and then cease to produce any output for
not just minutes but a few hours. This causes SSH inactivity timeouts. I
have tried running the repair with the --trace option, but then that leads
to the other extreme where there's just a torrent of output, scarcely any
of which I'll typically need.

As a literal solution to my SSH inactivity timeouts, I could extend the
timeouts, or I could do some scripting jujitsu with
StrictHostKeyChecking=no and a loop that spits some arbitrary output until
the command finishes. But even if the timeouts were no concern, the sheer
unresponsiveness is apt to make an operator nervous. And I'd like to think
there's a Goldilocks way to run a full nodetool repair on a large dataset
where it's just a bit more responsive without going all TMI. Thoughts?
Anyone else notice this?


Re: quietness of full nodetool repair on large dataset

2017-09-28 Thread Jeff Jirsa
Screen and/or subrange repair (e.g. reaper)

-- 
Jeff Jirsa


> On Sep 28, 2017, at 8:23 PM, Mitch Gitman  wrote:
> 
> I'm on Apache Cassandra 3.10. I'm interested in moving over to Reaper for 
> repairs, but in the meantime, I want to get nodetool repair working a little 
> more gracefully. 
> 
> What I'm noticing is that, when I'm running a repair for the first time with 
> the --full option after a large initial load of data, the client will say 
> it's starting on a repair job and then cease to produce any output for not 
> just minutes but a few hours. This causes SSH inactivity timeouts. I have 
> tried running the repair with the --trace option, but then that leads to the 
> other extreme where there's just a torrent of output, scarcely any of which 
> I'll typically need. 
> 
> As a literal solution to my SSH inactivity timeouts, I could extend the 
> timeouts, or I could do some scripting jujitsu with StrictHostKeyChecking=no 
> and a loop that spits some arbitrary output until the command finishes. But 
> even if the timeouts were no concern, the sheer unresponsiveness is apt to 
> make an operator nervous. And I'd like to think there's a Goldilocks way to 
> run a full nodetool repair on a large dataset where it's just a bit more 
> responsive without going all TMI. Thoughts? Anyone else notice this?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



cassandra hardware requirements (STAT/SSD)

2017-09-28 Thread Peng Xiao
Hi there,
we are struggling on hardware selection,we all know that ssd is good,and 
Datastax suggests us to use ssd,as Cassandra is a CPU bound db,we are 
considering to use sata disk,we noticed that the normal IO throughput is 7MB/s.


Could anyone give some advice?


Thanks,
Peng Xiao