Thanks Nate for your quick reply. We will test with different 
concurrent_compactors settings. It would save lot of time for others if 
documentation can be fixed. We spent days to come up with this setting and that 
too by chance.

As far as data folder and IO is concerned. I confirmed that data folders in 
both cases is the same hardly any reads in both cases (see below). Can you tell 
me what could trigger very high read repair numbers in 2.1.11 compared to 2.0.9 
(10 times more in 2.1.11)?

Please find tpstats and iostat for both 2.0.9 and 2.1.11:
Tpstats for 2.0.9
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0        4352903         0          
       0
ReadStage                         0         0       46282140         0          
       0
RequestResponseStage              0         0       12779370         0          
       0
ReadRepairStage                   0         0          18719         0          
       0
ReplicateOnWriteStage             0         0              0         0          
       0
MiscStage                         0         0              0         0          
       0
HintedHandoff                     0         0              5         0          
       0
FlushWriter                       0         0          91885         0          
      10
MemoryMeter                       0         0          82032         0          
       0
GossipStage                       0         0         457802         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
InternalResponseStage             0         0              6         0          
       0
CompactionExecutor                0         0         993103         0          
       0
ValidationExecutor                0         0              0         0          
       0
MigrationStage                    0         0             28         0          
       0
commitlog_archiver                0         0              0         0          
       0
AntiEntropyStage                  0         0              0         0          
       0
PendingRangeCalculator            0         0              5         0          
       0
MemtablePostFlusher               0         0          94496         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Tpstats for 2.1.11
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0        1113428         0          
       0
ReadStage                         0         0       23496750         0          
       0
RequestResponseStage              0         0       29951269         0          
       0
ReadRepairStage                   0         0        3848733         0          
       0
CounterMutationStage              0         0              0         0          
       0
MiscStage                         0         0              0         0          
       0
HintedHandoff                     0         0              4         0          
       0
GossipStage                       0         0         182727         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
InternalResponseStage             0         0              0         0          
       0
CommitLogArchiver                 0         0              0         0          
       0
CompactionExecutor                0         0          89820         0          
       0
ValidationExecutor                0         0              0         0          
       0
MigrationStage                    0         0             10         0          
       0
AntiEntropyStage                  0         0              0         0          
       0
PendingRangeCalculator            0         0              6         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0          38222         0          
       0
MemtablePostFlush                 0         0          39814         0          
       0
MemtableReclaimMemory             0         0          38222         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

IOSTAT for 2.1.11
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.21    1.10    0.70    0.12    0.03   76.84

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await  svctm  %util
xvda              0.01     3.00    0.03    2.50     0.00     0.03    28.87     
0.01    5.09   0.27   0.07
xvdb              0.00    17.05    0.03   17.52     0.00     0.49    57.06     
0.03    1.79   0.41   0.71
xvdc              0.00    17.31    0.03   17.93     0.00     0.50    56.93     
0.03    1.74   0.40   0.72
dm-0              0.00     0.00    0.07   56.41     0.00     0.99    35.82     
0.11    2.01   0.23   1.27

IOSTAT for 2.0.9
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn  xvda    
          3.87         3.34       211.09   89522823 5655075464  xvdb            
  5.37         3.82       408.26  102210432 10937070024 xvdc              5.81  
       4.18       435.33  111917570 11662380112 dm-0             20.35         
7.99       843.59  214122034 22599449976




From: Nate McCall <n...@thelastpickle.com<mailto:n...@thelastpickle.com>>
Reply-To: Cassandra Users 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Friday, January 29, 2016 at 3:01 PM
To: Cassandra Users 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Slow performance after upgrading from 2.0.9 to 2.1.11



On Fri, Jan 29, 2016 at 12:30 PM, Peddi, Praveen 
<pe...@amazon.com<mailto:pe...@amazon.com>> wrote:
>
> Hello,
> We have another update on performance on 2.1.11. compression_chunk_size  
> didn’t really help much but We changed concurrent_compactors from default to 
> 64 in 2.1.11 and read latencies improved significantly. However, 2.1.11 read 
> latencies are still 1.5 slower than 2.0.9. One thing we noticed in JMX metric 
> that could affect read latencies is that 2.1.11 is running 
> ReadRepairedBackground and ReadRepairedBlocking too frequently compared to 
> 2.0.9 even though our read_repair_chance is same on both. Could anyone shed 
> some light on why 2.1.11 could be running read repair 10 to 50 times more in 
> spite of same configuration on both clusters?
>
> dclocal_read_repair_chance=0.100000 AND
> read_repair_chance=0.000000 AND
>
> Here is the table for read repair metrics for both clusters.
> 2.0.9 2.1.11
> ReadRepairedBackground 5MinAvg 0.006 0.1
> 15MinAvg 0.009 0.153
> ReadRepairedBlocking 5MinAvg 0.002 0.55
> 15MinAvg 0.007 0.91

The concurrent_compactors setting is not a surprise. The default in 2.0 was the 
number of cores and in 2.1 is now:
"the smaller of (number of disks, number of cores), with a minimum of 2 and a 
maximum of 8"
https://github.com/apache/cassandra/blob/cassandra-2.1/conf/cassandra.yaml#L567-L568

So in your case this was "8" in 2.0 vs. "2" in 2.1 (assuming these are still 
the stock-ish c3.2xl mentioned previously?). Regardless, 64 is way to high. Set 
it back to 8.

Note: this got dropped off the "Upgrading" guide for 2.1 in 
https://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt though, so lots 
of folks miss it.

Per said upgrading guide - are you sure the data directory is in the same place 
between the two versions and you are not pegging the wrong disk/partition? The 
default locations changed for data, cache and commitlog:
https://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt#L171-L180

I ask because being really busy on a single disk would cause latency and 
potentially dropped messages which could eventually cause a 
DigestMismatchException requiring a blocking read repair.

Anything unusual in the node-level IO activity between the two clusters?

That said, the difference in nodetool tpstats output during and after on both 
could be insightful.

When we do perf tests internally we usually use a combination of Grafana and 
Riemann to monitor Cassandra internals, the JVM and the OS. Otherwise, it's 
guess work.

--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Reply via email to