Re: Slow Responses from 2 of 3 nodes in RC1

Jonathan Ellis Mon, 05 Apr 2010 14:22:31 -0700

When you're saying you can check 50 or 100 per second, how many rows
and columns does a check involve?  What query api are you using?


Your cassandra nodes look mostly idle.  Is each client thread getting
the same amount of work or are some finishing sooner than others?  Is
your client cpu or disk perhaps the bottleneck?

On Fri, Apr 2, 2010 at 2:39 PM, Mark Jones <mjo...@imagehawk.com> wrote:
> To further complicate matters,
>  when I read only from cassdb1, I can check about 100/second/thread (40 
> threads)
>  when I read only from cassdb2, I can check about 50/second/thread (40 
> threads)
>  when I read only from cassdb3, I can check about 50/second/thread (40 
> threads)
>
> This is with a consistency level of ONE, ALL, or QUORUM....  All 3 levels 
> return about the same read rate (~5/second), yet 2 nodes return them at 1/2 
> speed of the other node.
>
> I don't understand how this could be since QUORUM or ALL would require 2 of 
> the 3 to respond in ALL cases, so you would expect the read rate to the 
> 50/second/thread or 100/second/thread, regardless of who does the proxy.
>
> -----Original Message-----
> From: Mark Jones [mailto:mjo...@imagehawk.com]
> Sent: Friday, April 02, 2010 1:38 PM
> To: user@cassandra.apache.org
> Subject: Slow Responses from 2 of 3 nodes in RC1
>
> I have a 3 node cassandra cluster I'm trying to work with:
>
> All three machines are about the same:
> 6-8GB per machine  (fastest machine has 8GB, JavaVM limited to 5GB)
> separate spindle for cassandra data and commit log
>
> I wrote ~7 Million items to Cassandra, now, I'm trying to read them back, the 
> ones that are missing, might be troubling, but I'm not worried about that 
> yet.  Part of the reason I only have ~7 million items in, is that 2 of the 
> nodes are NOT pulling their weight:
>
>
> I've used "nodetool loadbalance" on them, to get the data evened out, it was 
> terribly imbalanced after ingestion, but it now looks like this:
>
> Address       Status     Load          Range                                  
>     Ring
>                                       169214437894733073017295274330696200891
> 192.168.1.116 Up         1.88 GB       83372832363385696737577075791407985563 
>     |<--|     (cassdb2)
> 192.168.1.119 Up         2.59 GB       
> 167732545381904888270252256443838855184    |   |     (cassdb3)
> 192.168.1.12  Up         2.5 GB        
> 169214437894733073017295274330696200891    |-->|     (cassdb1)
>
> This is a summary report from my checking program(c++).  It runs one thread 
> per file (files contain the originally ingested data), checking to see if the 
> data inserted is present and the same as when it was inserted.  Each thread 
> has its own thrift and Cassandra connection setup. Connection point is 
> randomly chosen at startup and that connection is reused by that thread until 
> the end of the test.  All the threads are running simultaneously and I would 
> expect similar results, but one node is beating the pants off the other two 
> nodes for performance.
>
> In the logs, there are nothing but INFO lines like these (there are others 
> that give less info about performance), no exceptions, warnings:
> cassdb1:
> INFO [COMPACTION-POOL:1] 2010-04-02 08:20:35,339 CompactionManager.java (line 
> 326) Compacted to /cassandra/data/bumble/Contacts-15-Data.db.  
> 262279345/243198299 bytes for 324378 keys.  Time: 16488ms.
>
> cassdb2:
> INFO [COMPACTION-POOL:1] 2010-04-02 08:20:16,448 CompactionManager.java (line 
> 326) Compacted to /cassandra/data/bumble/Contacts-5-Data.db.  
> 251086153/234535924 bytes for 284088 keys.  Time: 22805ms.
>
> cassdb3:
> INFO [COMPACTION-POOL:1] 2010-04-02 08:20:24,429 CompactionManager.java (line 
> 326) Compacted to /cassandra/data/bumble/Contacts-20-Data.db.  
> 266451419/248084737 bytes for 347531 keys.  Time: 25094ms.
>
>
> How do I go about figuring out what is going on in this setup?
>
> Iostat -x data is at the bottom
>
> cassdb1 Checked:    9773 Good:    9770 Missing:       3 Miscompared:        0 
>  '/tmp/QUECD0000000005
> cassdb1 Checked:    9818 Good:    9817 Missing:       1 Miscompared:        0 
>  '/tmp/QUEDE0000000005
> cassdb1 Checked:    9820 Good:    9820 Missing:       0 Miscompared:        0 
>  '/tmp/QUEQ0000000005
> cassdb1 Checked:    9836 Good:    9836 Missing:       0 Miscompared:        0 
>  '/tmp/QUEJ0000000005
> cassdb1 Checked:    9843 Good:    9843 Missing:       0 Miscompared:        0 
>  '/tmp/QUEFG0000000005
> cassdb1 Checked:    9883 Good:    9883 Missing:       0 Miscompared:        0 
>  '/tmp/QUENO0000000005
> cassdb1 Checked:    9884 Good:    9883 Missing:       1 Miscompared:        0 
>  '/tmp/QUEIJ0000000005
> cassdb1 Checked:    9890 Good:    9890 Missing:       0 Miscompared:        0 
>  '/tmp/QUER0000000005
> cassdb1 Checked:    9915 Good:    9913 Missing:       2 Miscompared:        0 
>  '/tmp/QUEMN0000000005
> cassdb1 Checked:    9962 Good:    9962 Missing:       0 Miscompared:        0 
>  '/tmp/QUEF0000000005
> cassdb1 Checked:   10120 Good:   10120 Missing:       0 Miscompared:        0 
>  '/tmp/QUEH0000000005
> cassdb1 Checked:   10123 Good:   10123 Missing:       0 Miscompared:        0 
>  '/tmp/QUEM0000000005
> cassdb1 Checked:   10280 Good:   10280 Missing:       0 Miscompared:        0 
>  '/tmp/QUEP0000000005
> cassdb1 Checked:   10490 Good:   10490 Missing:       0 Miscompared:        0 
>  '/tmp/QUEL0000000005
>
> cassdb2 Checked:       0 Good:       0 Missing:       0 Miscompared:        0 
>  '/tmp/QUEC0000000005
> cassdb2 Checked:       1 Good:       1 Missing:       0 Miscompared:        0 
>  '/tmp/QUEN0000000005
> cassdb2 Checked:       2 Good:       2 Missing:       0 Miscompared:        0 
>  '/tmp/QUEEF0000000005
> cassdb2 Checked:       2 Good:       2 Missing:       0 Miscompared:        0 
>  '/tmp/QUEG0000000005
> cassdb2 Checked:       3 Good:       3 Missing:       0 Miscompared:        0 
>  '/tmp/QUEV0000000005
> cassdb2 Checked:       3 Good:       3 Missing:       0 Miscompared:        0 
>  '/tmp/QUEX0000000005
> cassdb2 Checked:       4 Good:       4 Missing:       0 Miscompared:        0 
>  '/tmp/QUEB0000000005
> cassdb2 Checked:       4 Good:       4 Missing:       0 Miscompared:        0 
>  '/tmp/QUEBC0000000005
> cassdb2 Checked:       5 Good:       5 Missing:       0 Miscompared:        0 
>  '/tmp/QUEAB0000000005
> cassdb2 Checked:       5 Good:       5 Missing:       0 Miscompared:        0 
>  '/tmp/QUET0000000005
> cassdb2 Checked:       6 Good:       6 Missing:       0 Miscompared:        0 
>  '/tmp/QUEJK0000000005
> cassdb2 Checked:       7 Good:       7 Missing:       0 Miscompared:        0 
>  '/tmp/QUEO0000000005
> cassdb2 Checked:       9 Good:       9 Missing:       0 Miscompared:        0 
>  '/tmp/QUED0000000005
>
> cassdb2 Checked:      10 Good:      10 Missing:       0 Miscompared:        0 
>  '/tmp/QUEK0000000005
> cassdb3 Checked:      13 Good:      13 Missing:       0 Miscompared:        0 
>  '/tmp/QUEHI0000000005
> cassdb3 Checked:      17 Good:      17 Missing:       0 Miscompared:        0 
>  '/tmp/QUES0000000005
> cassdb3 Checked:      18 Good:      18 Missing:       0 Miscompared:        0 
>  '/tmp/QUEI0000000005
> cassdb3 Checked:      19 Good:      19 Missing:       0 Miscompared:        0 
>  '/tmp/QUEW0000000005
> cassdb3 Checked:      20 Good:      20 Missing:       0 Miscompared:        0 
>  '/tmp/QUEE0000000005
> cassdb3 Checked:      20 Good:      20 Missing:       0 Miscompared:        0 
>  '/tmp/QUEY0000000005
> cassdb3 Checked:      21 Good:      21 Missing:       0 Miscompared:        0 
>  '/tmp/QUEA0000000005
> cassdb3 Checked:      21 Good:      21 Missing:       0 Miscompared:        0 
>  '/tmp/QUELM0000000005
> cassdb3 Checked:      21 Good:      21 Missing:       0 Miscompared:        0 
>  '/tmp/QUEU0000000005
> cassdb3 Checked:      23 Good:      23 Missing:       0 Miscompared:        0 
>  '/tmp/QUEGH0000000005
> cassdb3 Checked:      23 Good:      23 Missing:       0 Miscompared:        0 
>  '/tmp/QUEKL0000000005
> cassdb3 Checked:      23 Good:      23 Missing:       0 Miscompared:        0 
>  '/tmp/QUEZ0000000005
>
> cassdb1:
> Linux 2.6.31-14-generic (record)        04/02/2010      _x86_64_        (2 
> CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           6.81    0.09    2.23    0.36    0.00   90.51
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
> avgqu-sz   await  svctm  %util
> sda               0.27    25.61    0.89    0.77    27.80   210.90   143.64    
>  0.05   27.35   4.67   0.78
> sdb               0.02     0.00    0.05    0.00     0.78     0.02    15.29    
>  0.00   10.96  10.95   0.06
> sdc               0.02    35.43    0.01    0.75     0.15   718.84   947.70    
>  0.28  371.59   4.36   0.33
>
> cassdb2:
> Linux 2.6.31-14-generic (ec2)   04/02/2010      _x86_64_        (2 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           9.34    0.01    1.50    3.85    0.00   85.31
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
> avgqu-sz   await  svctm  %util
> sda               2.51     0.47    3.35    0.69   176.95    11.07    46.50    
>  0.32   79.83   5.22   2.11
> sdb               0.06     0.00    0.02    0.00     0.60     0.00    32.17    
>  0.00    2.83   2.83   0.01
> sdc               0.21     0.00   23.43    0.01  1581.80     1.74    67.55    
>  0.45   19.18   3.67   8.61
>
> cassdb3:
> Linux 2.6.31-14-generic (ec1)   04/02/2010      _x86_64_        (2 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          12.57    0.12    1.91    0.35    0.00   85.06
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
> avgqu-sz   await  svctm  %util
> sda               0.45    32.56    0.83    0.89    31.44   267.46   173.90    
>  0.04   20.99   4.45   0.77
> sdb               0.02    37.61    1.35    0.79    26.30   674.95   327.98    
>  0.28  133.03   3.36   0.72
>

Re: Slow Responses from 2 of 3 nodes in RC1

Reply via email to