UnavailableException with 3 nodes and RF=2

2010-09-14 Thread Chris Jansen
Hi All,

 

I'm a newbie to Cassandra so I could have a configuration issue here, I
am using the latest stable release 0.6.0.

 

I have created a cluster of 3 nodes, a keyspace with RF=2 and a rack
unaware replication strategy. When I write with CL=QUORUM with all 3
nodes commit the data fine, but when I write with the same CL with one
of the nodes down I see an UnavailableException thrown. Surely if one of
the nodes in the cluster is down another should acknowledge the writes
and maintain the quorum, or is there something that I have
misunderstood? From what I understand, in this case with a RF=2 for the
quorum writes to succeed I need two nodes to acknowledge the write
(RF/2+1), which I have.

 

Here is how the cluster looks when quorum writes succeed:

 

192.168.245.2 Up 477.33 KB
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB
163572901304139170217093255272499595459|-->|

 

This is how it looks with one node down and quorum writes fail (I am
writing to 192.168.245.1):

 

192.168.245.2 Down   423.58 KB
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB
163572901304139170217093255272499595459|-->|

 

Here is the exception that is thrown:

 

Cannot write: 9e48b039-7687-4b14-9b40-0096b15fd7b0 RETRYING

UnavailableException()

at
org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:
12303)

at
org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:
675)

at
org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:648)

at cassandraclient.Main.writeReadDelete(Main.java:101)

at cassandraclient.Main.run(Main.java:188)

at java.lang.Thread.run(Thread.java:619)

 

If I switch CL=ONE the writes succeed, but I don't know if the data is
being replicated.

 

Any help would be greatly appreciated, thanks.

 

Chris Jansen



NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU.  UK. 
Company number 02723032.  This e-mail message and any attachment is 
confidential. It may not be disclosed to or used by anyone other than the 
intended recipient. If you have received this e-mail in error please notify the 
sender immediately then delete it from your system. Whilst every effort has 
been made to check this mail is virus free we accept no responsibility for 
software viruses and you should check for viruses before opening any 
attachments. Opinions, conclusions and other information in this email and any 
attachments which do not relate to the official business of the company are 
neither given by the company nor endorsed by it.

This email message has been scanned for viruses by Mimecast

RE: UnavailableException with 3 nodes and RF=2

2010-09-14 Thread Dr . Martin Grabmüller
When you write with QUORUM, RF/2+1 of the nodes cassandra *wants to write*
to have to be up.  In your case, RF/2+1 = 2, that means, the two nodes 
responsible
for the write have to be up, not any two nodes.  Each write which tries to the 
node 
with token 78502309573904554351249603414557542595  and another node
will fail.
 
QUORUM consistency only gives you more availability when you have a RF of 3 or 
higher.
 
Martin



From: Chris Jansen [mailto:chris.jan...@cognitomobile.com] 
Sent: Tuesday, September 14, 2010 10:44 AM
To: user@cassandra.apache.org
Subject: UnavailableException with 3 nodes and RF=2



Hi All,

 

I’m a newbie to Cassandra so I could have a configuration issue here, I 
am using the latest stable release 0.6.0.

 

I have created a cluster of 3 nodes, a keyspace with RF=2 and a rack 
unaware replication strategy. When I write with CL=QUORUM with all 3 nodes 
commit the data fine, but when I write with the same CL with one of the nodes 
down I see an UnavailableException thrown. Surely if one of the nodes in the 
cluster is down another should acknowledge the writes and maintain the quorum, 
or is there something that I have misunderstood? From what I understand, in 
this case with a RF=2 for the quorum writes to succeed I need two nodes to 
acknowledge the write (RF/2+1), which I have.

 

Here is how the cluster looks when quorum writes succeed:

 

192.168.245.2 Up 477.33 KB 
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB 
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB 
163572901304139170217093255272499595459|-->|

 

This is how it looks with one node down and quorum writes fail (I am 
writing to 192.168.245.1):

 

192.168.245.2 Down   423.58 KB 
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB 
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB 
163572901304139170217093255272499595459|-->|

 

Here is the exception that is thrown:

 

Cannot write: 9e48b039-7687-4b14-9b40-0096b15fd7b0 RETRYING

UnavailableException()

at 
org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:12303)

at 
org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:675)

at 
org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:648)

at cassandraclient.Main.writeReadDelete(Main.java:101)

at cassandraclient.Main.run(Main.java:188)

at java.lang.Thread.run(Thread.java:619)

 

If I switch CL=ONE the writes succeed, but I don’t know if the data is 
being replicated.

 

Any help would be greatly appreciated, thanks.

 

Chris Jansen




NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU. 
UK. Company number 02723032. This e-mail message and any attachment is 
confidential. It may not be disclosed to or used by anyone other than the 
intended recipient. If you have received this e-mail in error please notify the 
sender immediately then delete it from your system. Whilst every effort has 
been made to check this mail is virus free we accept no responsibility for 
software viruses and you should check for viruses before opening any 
attachments. Opinions, conclusions and other information in this email and any 
attachments which do not relate to the official business of the company are 
neither given by the company nor endorsed by it.



This email message has been scanned for viruses by Mimecast 
  



Re: UnavailableException with 3 nodes and RF=2

2010-09-14 Thread Sylvain Lebresne
On Tue, Sep 14, 2010 at 10:43 AM, Chris Jansen
 wrote:
> Hi All,
>
>
>
> I’m a newbie to Cassandra so I could have a configuration issue here, I am
> using the latest stable release 0.6.0.
>
>
>
> I have created a cluster of 3 nodes, a keyspace with RF=2 and a rack unaware
> replication strategy. When I write with CL=QUORUM with all 3 nodes commit
> the data fine, but when I write with the same CL with one of the nodes down
> I see an UnavailableException thrown. Surely if one of the nodes in the
> cluster is down another should acknowledge the writes and maintain the
> quorum, or is there something that I have misunderstood? From what I
> understand, in this case with a RF=2 for the quorum writes to succeed I need
> two nodes to acknowledge the write (RF/2+1), which I have.

RF=2 means that each row is replicated on 2 of your nodes. As you said,
Quorum is then 2. This means that for a quorum operation to succeed, you
need that the 2 nodes out of the 2 that holds the row (*not* 2 out of
all the nodes)
be alive. To say it otherwise, if *any* of your node is dead, some
operation will
fail with unavailable exception. That is, quorum support a node being down only
starting at RF=3.

>
>
>
> Here is how the cluster looks when quorum writes succeed:
>
>
>
> 192.168.245.2 Up 477.33 KB
> 78502309573904554351249603414557542595 |<--|
>
> 192.168.245.4 Up 426.74 KB
> 139625953069891725539207365034742863768    |   |
>
> 192.168.245.1 Up 496.67 KB
> 163572901304139170217093255272499595459    |-->|
>
>
>
> This is how it looks with one node down and quorum writes fail (I am writing
> to 192.168.245.1):
>
>
>
> 192.168.245.2 Down   423.58 KB
>  78502309573904554351249603414557542595 |<--|
>
> 192.168.245.4 Up 426.74 KB
> 139625953069891725539207365034742863768    |   |
>
> 192.168.245.1 Up 496.67 KB
> 163572901304139170217093255272499595459    |-->|
>
>
>
> Here is the exception that is thrown:
>
>
>
> Cannot write: 9e48b039-7687-4b14-9b40-0096b15fd7b0 RETRYING
>
> UnavailableException()
>
>     at
> org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:12303)
>
>     at
> org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:675)
>
>     at
> org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:648)
>
>     at cassandraclient.Main.writeReadDelete(Main.java:101)
>
>     at cassandraclient.Main.run(Main.java:188)
>
>     at java.lang.Thread.run(Thread.java:619)
>
>
>
> If I switch CL=ONE the writes succeed, but I don’t know if the data is being
> replicated.

Whatever the consistency level you use for a write, the data is always
replicated
unless some error occurs. The difference being whether the write waits to see if
an error occurs or not.

--
Sylvain

>
>
>
> Any help would be greatly appreciated, thanks.
>
>
>
> Chris Jansen
>
>
> NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU. UK.
> Company number 02723032. This e-mail message and any attachment is
> confidential. It may not be disclosed to or used by anyone other than the
> intended recipient. If you have received this e-mail in error please notify
> the sender immediately then delete it from your system. Whilst every effort
> has been made to check this mail is virus free we accept no responsibility
> for software viruses and you should check for viruses before opening any
> attachments. Opinions, conclusions and other information in this email and
> any attachments which do not relate to the official business of the company
> are neither given by the company nor endorsed by it.
>
> This email message has been scanned for viruses by Mimecast


column limit on multiget_slice or get_slice

2010-09-14 Thread Courtney Robinson
Is it possible to get the first x columns from a row without knowing the 
column names?
So far i've been working with just grabbing all the columns in a row or just 
getting a specific column that i know the name of.
If it is possible, can anyone point me in the right direction of how to do 
this?
I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much 
prefer knowing how its done via thrift first :)
thanks 



Re: column limit on multiget_slice or get_slice

2010-09-14 Thread Chen Xinli
you can use get_slice:
public List get_slice(String keyspace, String key,
ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel
consistency_level) throws InvalidRequestException, UnavailableException,
TimedOutException, TException;

In the SlicePredicate.SliceRange, set start and finish to empty, count to x

2010/9/14 Courtney Robinson 

> Is it possible to get the first x columns from a row without knowing the
> column names?
> So far i've been working with just grabbing all the columns in a row or
> just getting a specific column that i know the name of.
> If it is possible, can anyone point me in the right direction of how to do
> this?
> I'm using 0.6.4 with the thrift interface in java, i use hector but i'd
> much prefer knowing how its done via thrift first :)
> thanks
>



-- 
Best Regards,
Chen Xinli


Re: column limit on multiget_slice or get_slice

2010-09-14 Thread Courtney Robinson
Ahhh, excellent.
thank you


From: Chen Xinli 
Sent: Tuesday, September 14, 2010 10:51 AM
To: user@cassandra.apache.org 
Subject: Re: column limit on multiget_slice or get_slice


you can use get_slice: 
public List get_slice(String keyspace, String key, 
ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel 
consistency_level) throws InvalidRequestException, UnavailableException, 
TimedOutException, TException;

In the SlicePredicate.SliceRange, set start and finish to empty, count to x


2010/9/14 Courtney Robinson 

  Is it possible to get the first x columns from a row without knowing the 
column names?
  So far i've been working with just grabbing all the columns in a row or just 
getting a specific column that i know the name of.
  If it is possible, can anyone point me in the right direction of how to do 
this?
  I'm using 0.6.4 with the thrift interface in java, i use hector but i'd much 
prefer knowing how its done via thrift first :)
  thanks 




-- 
Best Regards,
Chen Xinli


RE: UnavailableException with 3 nodes and RF=2

2010-09-14 Thread Chris Jansen
Thank you Martin, this has cleared things up for me. I thought that a replica 
would always be stored on the node I was connecting to, which makes sense as to 
why the load on each node is equally balanced.

 

So I could sustain quorum with two node failures if I have a RF=5 or greater.

 

Thanks again.

 

Chris

 

From: Dr. Martin Grabmüller [mailto:martin.grabmuel...@eleven.de] 
Sent: 14 September 2010 09:54
To: user@cassandra.apache.org
Subject: RE: UnavailableException with 3 nodes and RF=2

 

When you write with QUORUM, RF/2+1 of the nodes cassandra *wants to write*

to have to be up.  In your case, RF/2+1 = 2, that means, the two nodes 
responsible

for the write have to be up, not any two nodes.  Each write which tries to the 
node 

with token 78502309573904554351249603414557542595  and another node

will fail.

 

QUORUM consistency only gives you more availability when you have a RF of 3 or 
higher.

 

Martin





From: Chris Jansen [mailto:chris.jan...@cognitomobile.com] 
Sent: Tuesday, September 14, 2010 10:44 AM
To: user@cassandra.apache.org
Subject: UnavailableException with 3 nodes and RF=2

Hi All,

 

I'm a newbie to Cassandra so I could have a configuration issue here, I 
am using the latest stable release 0.6.0.

 

I have created a cluster of 3 nodes, a keyspace with RF=2 and a rack 
unaware replication strategy. When I write with CL=QUORUM with all 3 nodes 
commit the data fine, but when I write with the same CL with one of the nodes 
down I see an UnavailableException thrown. Surely if one of the nodes in the 
cluster is down another should acknowledge the writes and maintain the quorum, 
or is there something that I have misunderstood? From what I understand, in 
this case with a RF=2 for the quorum writes to succeed I need two nodes to 
acknowledge the write (RF/2+1), which I have.

 

Here is how the cluster looks when quorum writes succeed:

 

192.168.245.2 Up 477.33 KB 
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB 
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB 
163572901304139170217093255272499595459|-->|

 

This is how it looks with one node down and quorum writes fail (I am 
writing to 192.168.245.1):

 

192.168.245.2 Down   423.58 KB 
78502309573904554351249603414557542595 |<--|

192.168.245.4 Up 426.74 KB 
139625953069891725539207365034742863768|   |

192.168.245.1 Up 496.67 KB 
163572901304139170217093255272499595459|-->|

 

Here is the exception that is thrown:

 

Cannot write: 9e48b039-7687-4b14-9b40-0096b15fd7b0 RETRYING

UnavailableException()

at 
org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:12303)

at 
org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:675)

at 
org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:648)

at cassandraclient.Main.writeReadDelete(Main.java:101)

at cassandraclient.Main.run(Main.java:188)

at java.lang.Thread.run(Thread.java:619)

 

If I switch CL=ONE the writes succeed, but I don't know if the data is 
being replicated.

 

Any help would be greatly appreciated, thanks.

 

Chris Jansen




NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU. 
UK. Company number 02723032. This e-mail message and any attachment is 
confidential. It may not be disclosed to or used by anyone other than the 
intended recipient. If you have received this e-mail in error please notify the 
sender immediately then delete it from your system. Whilst every effort has 
been made to check this mail is virus free we accept no responsibility for 
software viruses and you should check for viruses before opening any 
attachments. Opinions, conclusions and other information in this email and any 
attachments which do not relate to the official business of the company are 
neither given by the company nor endorsed by it.

This email message has been scanned for viruses by Mimecast 
  


Minor question on index design

2010-09-14 Thread Janne Jalkanen
Hi all!

I'm pondering between a couple of alternatives here: I've got two CFs, one 
which contains Objects, and one which contains Users. Now, each Object has an 
owner associated to it, so obviously I need some sort of an index to point from 
Users to Objects.  This would be of course the perfect usecase for secondary 
indices on 0.7, but I'm still on 0.6.x.

So, esteemed Cassandra-heads, I'm pondering what would be a better design here:

1) I can create a separate CF "OwnerIdx" which has user id's as keys, and then 
each of the columns points at an object (with a dummy value, since I just need 
a list).  This would add a new CF, but on the other hand, this would be easy to 
drop once 0.7 comes along and I can just make a index query to the Objects CF, 
OR

2) Put the index inside the Users CF, with "object:" for column name and a 
dummy value, and then get slices as necessary? This would mean less CFs (and 
hence no schema modification), but might mean that I have to clean it up at 
some point.

I don't yet have a lot of CFs, so I'm not worried about mem consumption really. 
 The Users CF is very read-heavy as-is, but the index and Objects will be a bit 
more balanced.

Experiences? Recommendations? Tips? Other possibilities? What other 
considerations should I take into account?

/Janne

Removing Data

2010-09-14 Thread Jeremiah Jordan
Is setting a value to '' the same as deleting it in terms of disk space
being free'd?  Will it still take gc_grace_seconds for the old data to
be removed from disk?

-Jeremiah

Jeremiah Jordan
Application Developer
Morningstar, Inc.

Morningstar. Illuminating investing worldwide.

+1 312 696-6128 voice
jeremiah.jor...@morningstar.com

www.morningstar.com

This e-mail contains privileged and confidential information and is
intended only for the use of the person(s) named above. Any
dissemination, distribution, or duplication of this communication
without prior written consent from Morningstar is strictly prohibited.
If you have received this message in error, please contact the sender
immediately and delete the materials from any computer.



Re: Removing Data

2010-09-14 Thread Jonathan Ellis
On Tue, Sep 14, 2010 at 9:51 AM, Jeremiah Jordan
 wrote:
> Is setting a value to ‘’ the same as deleting it in terms of disk space
> being free’d?

no, you're saying "preserve that column X has an empty value, forever."

>  Will it still take gc_grace_seconds for the old data to be
> removed from disk?

No.  (It will take until it happens to be obsoleted by compaction.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Bootstrapping stays stuck

2010-09-14 Thread Gurpreet Singh
Hi,
I have a cassandra cluster of 4 machines, and I am trying to bootstrap 2
more machines, one at a time.
For both these machines, the bootstrapping stays stuck after the streaming
is done.

When the nodes come up for bootstrapping, I see all the relevant messages
about getting a new token, assuming load from a particular host. I see a
couple of nodes anticompacting data to send, and at a later point the node
that is bootstrapping prints the right streaming mesgs. However, once the
streaming is over, the node just doesnt do anything. Previously while
bootstrapping, I have seen that after the streaming is done, the node
restarts and becomes part of the ring by itself. I dont see this behaviour
with both the nodes I tried today.
I even restarted all the nodes in the cluster, and tried bootstrapping one
of the nodes again, but it again was stuck after streaming. It seems to have
copied the relevant load as well.
Any ideas as to what could be going on here?

/G


Re: Bootstrapping stays stuck

2010-09-14 Thread Gurpreet Singh
I am using cassandra 0.6.5.

On Tue, Sep 14, 2010 at 9:16 AM, Gurpreet Singh wrote:

> Hi,
> I have a cassandra cluster of 4 machines, and I am trying to bootstrap 2
> more machines, one at a time.
> For both these machines, the bootstrapping stays stuck after the streaming
> is done.
>
> When the nodes come up for bootstrapping, I see all the relevant messages
> about getting a new token, assuming load from a particular host. I see a
> couple of nodes anticompacting data to send, and at a later point the node
> that is bootstrapping prints the right streaming mesgs. However, once the
> streaming is over, the node just doesnt do anything. Previously while
> bootstrapping, I have seen that after the streaming is done, the node
> restarts and becomes part of the ring by itself. I dont see this behaviour
> with both the nodes I tried today.
> I even restarted all the nodes in the cluster, and tried bootstrapping one
> of the nodes again, but it again was stuck after streaming. It seems to have
> copied the relevant load as well.
> Any ideas as to what could be going on here?
>
> /G
>


Re: Couple of cache related questions

2010-09-14 Thread kannan chandrasekaran
Thanks a lot Jonathan !!!

Kannan





From: Jonathan Ellis 
To: user@cassandra.apache.org
Sent: Mon, September 13, 2010 4:47:05 PM
Subject: Re: Couple of cache related questions

On Sun, Sep 12, 2010 at 6:10 PM, kannan chandrasekaran
 wrote:
>> 1) What determines the amount of memory used per schema ignoring the
>> general
>> overhead to get cassandra up and running?  Is it just the size of the
>> caches
>> for the column Family + the memtable size ?
>
> and the bloom filter and index samples from the sstable files.
>
> Does that mean that cassandra tries to load the index and filter tables in
> memory as well, for each sstable in the keyspace?

it means it loads the bloom filter file, and a sample from the index file.

> Once the final memtable is flushed to the disk ( assuming no more writes) ,
> does read path also incur the memory size of the memtable for that
> particular CF ?

no.

> Does cassandra try to preallocate memory after startup for each schema even
> if its not used ( not being currently written to or read from)  ?

no.

> If I understand you correctly then I need to make sure that
>  the sum of sizes of all items in the cache across all the keyspaces +
> memtable + bloom filter + index samples  < Heap space

yes.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com



  

Re: Bootstrapping stays stuck

2010-09-14 Thread Gurpreet Singh
I tried this again, it happenned yet again.
This time while the transfer messages seemed tobe in order, i also noticed
that the source logs talk about having 9 dropped messages in the last 1000
ms. The only activity on the whole cluster is this bootstrapping, there is
no read/write traffic going on.

/G

On Tue, Sep 14, 2010 at 10:05 AM, Gurpreet Singh
wrote:

> I am using cassandra 0.6.5.
>
>
> On Tue, Sep 14, 2010 at 9:16 AM, Gurpreet Singh 
> wrote:
>
>> Hi,
>> I have a cassandra cluster of 4 machines, and I am trying to bootstrap 2
>> more machines, one at a time.
>> For both these machines, the bootstrapping stays stuck after the streaming
>> is done.
>>
>> When the nodes come up for bootstrapping, I see all the relevant messages
>> about getting a new token, assuming load from a particular host. I see a
>> couple of nodes anticompacting data to send, and at a later point the node
>> that is bootstrapping prints the right streaming mesgs. However, once the
>> streaming is over, the node just doesnt do anything. Previously while
>> bootstrapping, I have seen that after the streaming is done, the node
>> restarts and becomes part of the ring by itself. I dont see this behaviour
>> with both the nodes I tried today.
>> I even restarted all the nodes in the cluster, and tried bootstrapping one
>> of the nodes again, but it again was stuck after streaming. It seems to have
>> copied the relevant load as well.
>> Any ideas as to what could be going on here?
>>
>> /G
>>
>
>


Re: Bootstrapping stays stuck

2010-09-14 Thread vineet daniel
Hi Gurpreet

What is the output of  nodetool -h  streams  -->( to see what
is going on between the nodes) . If you dont see anything happening try
switching off firewall or iptables.


Regards
Vineet Daniel
Cell  : +918106217121
Websites :
Blog    |
Linkedin
|  Twitter 





On Tue, Sep 14, 2010 at 11:11 PM, Gurpreet Singh
wrote:

> I tried this again, it happenned yet again.
> This time while the transfer messages seemed tobe in order, i also noticed
> that the source logs talk about having 9 dropped messages in the last 1000
> ms. The only activity on the whole cluster is this bootstrapping, there is
> no read/write traffic going on.
>
> /G
>
> On Tue, Sep 14, 2010 at 10:05 AM, Gurpreet Singh  > wrote:
>
>> I am using cassandra 0.6.5.
>>
>>
>> On Tue, Sep 14, 2010 at 9:16 AM, Gurpreet Singh > > wrote:
>>
>>> Hi,
>>> I have a cassandra cluster of 4 machines, and I am trying to bootstrap 2
>>> more machines, one at a time.
>>> For both these machines, the bootstrapping stays stuck after the
>>> streaming is done.
>>>
>>> When the nodes come up for bootstrapping, I see all the relevant messages
>>> about getting a new token, assuming load from a particular host. I see a
>>> couple of nodes anticompacting data to send, and at a later point the node
>>> that is bootstrapping prints the right streaming mesgs. However, once the
>>> streaming is over, the node just doesnt do anything. Previously while
>>> bootstrapping, I have seen that after the streaming is done, the node
>>> restarts and becomes part of the ring by itself. I dont see this behaviour
>>> with both the nodes I tried today.
>>> I even restarted all the nodes in the cluster, and tried bootstrapping
>>> one of the nodes again, but it again was stuck after streaming. It seems to
>>> have copied the relevant load as well.
>>> Any ideas as to what could be going on here?
>>>
>>> /G
>>>
>>
>>
>


Re: Bootstrapping stays stuck

2010-09-14 Thread Gurpreet Singh
Hi Vineet,
I have tracked the nodetool streams to completion each time. Below are the
logs on the source and destination node. There are 3 sstables being
transferred, and the transfer seems to be successful. However, after the
streams finish, the source prints out messages about the dropped messages,
which may point to the problem. ideas? I checked port 7000 is open for
communication. 9160 is not up on the node being bootstrapped, but that comes
up after the node is bootstrapped, is that right?

Thanks a ton,
/G

*Logs on the source node (IP2):*
*
*
INFO [STREAM-STAGE:1] 2010-09-14 09:54:07,900 StreamOut.java (line 79)
Flushing memtables for userdata...
 INFO [STREAM-STAGE:1] 2010-09-14 09:54:07,900 StreamOut.java (line 95)
Performing anticompaction ...
 INFO [COMPACTION-POOL:1] 2010-09-14 09:54:07,900 CompactionManager.java
(line 339) AntiCompacting
[org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user_list_items-5823-Data.db')]
 INFO [GC inspection] 2010-09-14 09:56:54,712 GCInspector.java (line 129) GC
for ParNew: 212 ms, 29033016 reclaimed leaving 579419360 used; max is
4415946752
 INFO [COMPACTION-POOL:1] 2010-09-14 10:18:06,508 CompactionManager.java
(line 396) AntiCompacted to
/data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Data.db.
 49074138589/36770836242 bytes for 5990912 keys.  Time: 1438607ms.
 INFO [COMPACTION-POOL:1] 2010-09-14 10:18:06,528 CompactionManager.java
(line 339) AntiCompacting
[org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user-22-Data.db')]
 INFO [COMPACTION-POOL:1] 2010-09-14 10:18:08,839 CompactionManager.java
(line 396) AntiCompacted to
/data/mysql/cassandrastorage/userdata/stream/user-24-Data.db.
 28185244/21126422 bytes for 47722 keys.  Time: 2310ms.
 INFO [COMPACTION-POOL:1] 2010-09-14 10:18:08,840 CompactionManager.java
(line 339) AntiCompacting
[org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user_lists-502-Data.db')]
 INFO [COMPACTION-POOL:1] 2010-09-14 10:21:08,606 CompactionManager.java
(line 396) AntiCompacted to
/data/mysql/cassandrastorage/userdata/stream/user_lists-504-Data.db.
 2927724285/2195768325 bytes for 3976118 keys.  Time: 179766ms.
 INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,607 StreamOut.java (line 127)
Stream context metadata
/data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Index.db:522051369,
 3
sstables./data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Filter.db:7489045,
 3
sstables./data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Data.db:36770836242,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Index.db:3373143,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Filter.db:59965,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Data.db:21126422,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Index.db:282956452,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Filter.db:4970125,
 3
sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Data.db:2195768325
 INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,608 StreamOut.java (line 132)
Sending a stream initiate message to IP1...
 INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,608 StreamOut.java (line 137)
Waiting for transfer to IP1 to complete
 *WARN [DroppedMessagesLogger] 2010-09-14 10:28:00,592 MessagingService.java
(line 501) Dropped 9 messages in the last 1000ms*
* *INFO [STREAM-STAGE:1] 2010-09-14 10:28:00,605 StreamOut.java (line 141)
Done with transfer to IP1
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:00,670
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/datadir/cassandradb/system/LocationInfo-17-Data.db
 *WARN [DroppedMessagesLogger] 2010-09-14 10:28:01,602 MessagingService.java
(line 501) Dropped 1 messages in the last 1000ms*
* *INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,133
SSTableDeletingReference.java (line 104) Deleted
/data/mysql/cassandrastorage/system/LocationInfo-19-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,134
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/datadir/cassandradb/system/LocationInfo-18-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,134
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/datadir/cassandradb/system/LocationInfo-20-Data.db



*Logs on new node being bootstrapped (IP1):*

INFO [main] 2010-09-14 09:53:37,788 BootStrapper.java (line 104) New token
will be 149298847080838649048722691811739653739 to assume load from IP2
 INFO [main] 2010-09-14 09:53:37,789 StorageService.java (line 388) Joining:
sleeping 3 ms for pending range setup
 INFO [main] 2010-09-14 09:54:07,792 StorageService.java (line 388)
Bootstrapping
 INFO [Thread-17] 2010-09-14 10:26:40,699 SSTableReader.java (line 120)
Sampling index for
/data/mysql/cassandrastorage/userdata/user_list_items-2-Data

Re: RE: UnavailableException with 3 nodes and RF=2

2010-09-14 Thread Aaron Morton
For background have a read of http://wiki.apache.org/cassandra/HintedHandoffAs the doc (the one above and Martin :) ) says, CL ONE, QUORUM and ALL only count writes to nodes that are responsible for the key. Then HH is used to eventually deliver that write to any nodes that were not available. CL.ANY is a lot less consistent and will ack when only a HH is recorded. AFAIK you are right that upping the RF to 5 will mean you can lose two nodes *responsible for the key* and still run a QUORUM write. AaronOn 14 Sep, 2010,at 11:36 PM, Chris Jansen  wrote:Thank you Martin, this has cleared things up for me. I thought that a replica would always be stored on the node I was connecting to, which makes sense as to why the load on each node is equally balanced. So I could sustain quorum with two node failures if I have a RF=5 or greater. Thanks again. Chris From: Dr. Martin Grabmüller [mailto:martin.grabmuel...@eleven.de] Sent: 14 September 2010 09:54To: user@cassandra.apache.orgSubject: RE: UnavailableException with 3 nodes and RF=2 When you write with QUORUM, RF/2+1 of the nodes cassandra *wants to write*to have to be up.  In your case, RF/2+1 = 2, that means, the two nodes responsiblefor the write have to be up, not any two nodes.  Each write which tries to the node with token 78502309573904554351249603414557542595  and another nodewill fail. QUORUM consistency only gives you more availability when you have a RF of 3 or higher. MartinFrom: Chris Jansen [mailto:chris.jan...@cognitomobile.com] Sent: Tuesday, September 14, 2010 10:44 AMTo: user@cassandra.apache.orgSubject: UnavailableException with 3 nodes and RF=2Hi All, I’m a newbie to Cassandra so I could have a configuration issue here, I am using the latest stable release 0.6.0. I have created a cluster of 3 nodes, a keyspace with RF=2 and a rack unaware replication strategy. When I write with CL=QUORUM with all 3 nodes commit the data fine, but when I write with the same CL with one of the nodes down I see an UnavailableException thrown. Surely if one of the nodes in the cluster is down another should acknowledge the writes and maintain the quorum, or is there something that I have misunderstood? From what I understand, in this case with a RF=2 for the quorum writes to succeed I need two nodes to acknowledge the write (RF/2+1), which I have. Here is how the cluster looks when quorum writes succeed: 192.168.245.2 Up 477.33 KB 78502309573904554351249603414557542595 |<--|192.168.245.4 Up 426.74 KB 139625953069891725539207365034742863768    |   |192.168.245.1 Up 496.67 KB 163572901304139170217093255272499595459    |-->| This is how it looks with one node down and quorum writes fail (I am writing to 192.168.245.1): 192.168.245.2 Down   423.58 KB     78502309573904554351249603414557542595 |<--|192.168.245.4 Up 426.74 KB 139625953069891725539207365034742863768    |   |192.168.245.1 Up 496.67 KB 163572901304139170217093255272499595459    |-->| Here is the exception that is thrown: Cannot write: 9e48b039-7687-4b14-9b40-0096b15fd7b0 RETRYINGUnavailableException()    at orgapache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:12303)    at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:675)    at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:648)    at cassandraclient.Main.writeReadDelete(Main.java:101)    at cassandraclient.Main.run(Main.java:188)    at java.lang.Thread.run(Thread.java:619) If I switch CL=ONE the writes succeed, but I don’t know if the data is being replicated. Any help would be greatly appreciated, thanks. Chris JansenNOTICE: Cognito Limited Benham Valence, Newbury, Berkshire, RG20 8LU. UK. Company number 02723032. This e-mail message and any attachment is confidential. It may not be disclosed to or used by anyone other than the intended recipient. If you have received this e-mail in error please notify the sender immediately then delete it from your system. Whilst every effort has been made to check this mail is virus free we accept no responsibility for software viruses and you should check for viruses before opening any attachments. Opinions, conclusions and other information in this email and any attachments which do not relate to the official business of the company are neither given by the company nor endorsed by it.This email message has been scanned for viruses by Mimecast 

Re: Minor question on index design

2010-09-14 Thread Aaron Morton
I've been doing option 1 under 0.6. As usual in cassandra though a lot depends on how you access the data. - If you often want to get the user and all of the objects they have, use option 2. It's easier to have one read from one CF to answer your query. - If the user has potentially >10k objects go with option 2. AFAIK large super columns are still inefficient https://issues.apache.org/jira/browse/CASSANDRA-674 https://issues.apache.org/jira/browse/CASSANDRA-598- In your OwnerIndex CF consider making the column name something meaningful such as the Object Name or Timestamp (if it has one) so you can slice against it, e.g. to support paging operations. Make the column value the key for the object. AaronOn 15 Sep, 2010,at 02:41 AM, Janne Jalkanen  wrote:Hi all!

I'm pondering between a couple of alternatives here: I've got two CFs, one which contains Objects, and one which contains Users. Now, each Object has an owner associated to it, so obviously I need some sort of an index to point from Users to Objects.  This would be of course the perfect usecase for secondary indices on 0.7, but I'm still on 0.6.x.

So, esteemed Cassandra-heads, I'm pondering what would be a better design here:

1) I can create a separate CF "OwnerIdx" which has user id's as keys, and then each of the columns points at an object (with a dummy value, since I just need a list).  This would add a new CF, but on the other hand, this would be easy to drop once 0.7 comes along and I can just make a index query to the Objects CF, OR

2) Put the index inside the Users CF, with "object:" for column name and a dummy value, and then get slices as necessary? This would mean less CFs (and hence no schema modification), but might mean that I have to clean it up at some point.

I don't yet have a lot of CFs, so I'm not worried about mem consumption really.  The Users CF is very read-heavy as-is, but the index and Objects will be a bit more balanced.

Experiences? Recommendations? Tips? Other possibilities? What other considerations should I take into account?

/Janne

jconsole uname/password

2010-09-14 Thread adam
Hi,

I'm trying to use Jconsole to tune our instance.

jconsole is connecting to the JMX port, as verified by netstat on both
machines, but I get the following error:

The connection to : did not succeed.
Would you like to try again?

Could this be due to unset user/password are incorrect? What are the
defaults for cassandra? Where are they specified?

Or is this indicative of another problem?

thanks,
Adam


Cassandra performance

2010-09-14 Thread Kamil Gorlo
Hey,

we are considering using Cassandra for quite large project and because
of that I made some tests with Cassandra. I was testing performance
and stability mainly.

My main tool was stress.py for benchmarks (or equivalent written in
C++ to deal with python2.5 lack of multiprocessing). I will focus only
on reads (random with normal distribution, what is default in
stress.py) because writes were /quite/ good.

I have 8 machines (xen quests with dedicated pair of 2TB SATA disks
combined in RAID-O for every guest). Every machine has 4 individual
cores of 2.4 Ghz and 4GB RAM.

Cassandra commitlog and data dirs were on the same disk, I gave 2.5GB
for Heap for Cassandra, key and row cached were disabled (standard
Keyspace1 schema, all tests use Standard1 CF). All other options were
defaults. I've disabled cache because I was testing random (or semi
random - normal distribution) reads so it wouldnt help so much (and
also because 4GB of RAM is not a lot).

For first test I installed Cassandra on only one machine to test it
and remember results for further comparisons with large cluster and
other DBs.

1) RF was set to 1. I've inserted ~20GB of data (this is number
reported in load column form nodetool ring output) using stress.py
(100 colums per row). Then I've tested reads and got 200 rows/second
(reading 100 columns per row, CL=ONE, disks were bottleneck, util was
100%). There was no other operation pending during reads (compaction,
insertion, etc..).

2) So I moved to bigger cluster, with 8 machines and RF set to 2. I've
inserted about ~20GB data per node (so 20 GB * 8 / 2 = 80GB of "real
data"). Then I've tested reads, exactly te same way as before, and got
about 450 rows/second (reading 100 columns (but reading only 1 in fact
makes no difference), CL=ONE, disks on every machine was 100% util
because of random reads).

3) Then I changed RF from 2 to 3 on cluster described in 2). So I
ended with every node loaded with about 30GB of data. Then as usual,
I've tested reads, and got only 300 rows/second from whole cluster
(100% util on every disk).

4) Last test was with RF=3 as before, but I've inserted even more
data, so every node on 8-machines cluster had ~100GB of data (8 *
100GB / 3 = 266GB of real data). In this case I've got only 125
rows/second.

I was using multiple processes and machines to test reads.


*So my question is why these numbers are so low? What is especially
suprising for me is that changing RF from 2 to 3 drops performance
from 450 to 300 reads per second. Is this because of read repair?*


PS. To compare Cassandra performance with other DBs, I've also tested
MySQL with almost exact data (one table with two columns, key (int PK)
and value(VARCHAR(500))  simulating 100 columns in Cassandra for
single row). MySQL was installed on the same machine as Cassandra from
test 1) (which is one of these 8 machines described before). I've
inserted some data and then tested random reads (which was even worse
for caching because I've used standard rand() from C++ to generate
keys, not normal distribution). Here are results:

size of data in db -> reads per second
21 GB  -> 340
400 GB -> 200

So I've got more reads from single MySQL with 400GB of data than from
8 machines storing about 266GB. This doesn't look good. What am I
doing wrong? :)

Cheers,
Kamil


Memtable adjusting impact expectations?

2010-09-14 Thread Dathan Pattishall
Okay from what I gather. When data is written its always written to memory.
The flow for our concerns is the data is written to the commitLog then to
the memtable.

If any of memtable's 3 tunable thresholds are hit a flush occurs writing the
data sorted by key to the SSTABLE still enabling sequential disk access. The
fastest disk access is sequential while random is the slowest - this is just
a statement to make sure people are on the same page.

Now the data in the SSTABLE is immutable, and when the SSTABLE threshold is
hit (4 by default) the multiple SSTABLES are turned into a single TABLE,
this is compaction. Now I can't verify this because I never see more then 1
SSTABLE at a time, which could just be me misinterpreting the info. This
Operation singly should produce the most load.

Using JCONSOLE connected to a live Cass box JMX port and forced operations
from the MBEANS tab. I've quantified that normal operations for my
environment  is 3% CPU utilization.
Each Node roughly does 200 operations per second on a standard key/value
pair sorted by UTF8 with a Random Partitioner.

Doing a Flush for this 10MB Memtable consumes 33% of my CPU from Java's
point of view and that 33% is sustained for 15 mins - I assume some sorting
needs to be done according to this link:
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.FLUSH-WRITER-POOL

Doing a compaction consumes 30% of my CPU and lasts for 10 mins from the
subsequent flush.

Now to give some context my MemtableDataSize was around 11,000,000. I am
going to assume that this is in units of bytes so roughly 10MB of data. Is
assuming that MemtableDataSize is in units of bytes correct? If so why does
it take 15 mins to flush 10MB of data to a RAID-10 8 15K RPM disk array with
a 256 MB of memory and BBC on a dedicated hardware controller?

Additional to this I see in system.log

WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-26 08:27:11,541
MessageDeserializationTask.java (line 47) dropping message (9,713ms past
timeout)

39,625 events are showing. How do I get rid of this? If its a warning and is
dropping messages how can I fix the warning?



So, in summary. On Cassandra 0.6.4
with
TotalDiskSpaceUsed - 40GB
MemtableDataSize - 10MB
MemtableColumnsCount 130K

How do I remove MessageDeserializationTasks warnings?
How do I reduce the compaction time?
Why is flushing taking so long?
What do I look at to tune Memtable thresholds, because this
http://wiki.apache.org/cassandra/MemtableThresholds does not help, and I
believe that I'm not in a position where I need to tune the defaults.

http://www.riptano.com/blog/cassandra-annotated-changelog-063 allows me to
lower the priority of compaction, but to get the best compaction times
verses system load how do I tune for that?



Below is some sample output of system.log

Notice that SSTable 1155 is 320 MB, is the sorting taking all the time?
Because 320MB written even randomly should be very fast. 320 MB written
sequentially takes 2 seconds, while randomly I would expect 9 seconds. (320
MB/34 MB/sec random writes == 9 seconds)


 INFO [COMPACTION-POOL:1] 2010-09-14 20:40:39,120 CompactionManager.java
(line 320) Compacted to
/data/cassandra/data/TimeFrameClicks/Standard2-1155-Data.db.
339681221/339679627 bytes for 1253279 keys.  Time: 18645ms.
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,518
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/data/TimeFrameClicks/Standard2-1154-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,547
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/data/TimeFrameClicks/Standard2-1152-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,574
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/data/TimeFrameClicks/Standard2-1153-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,603
SSTableDeletingReference.java (line 104) Deleted
/data/cassandra/data/TimeFrameClicks/Standard2-1151-Data.db
 INFO [ROW-MUTATION-STAGE:6] 2010-09-14 22:24:33,320 ColumnFamilyStore.java
(line 357) Standard2 has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/data/cassandra/commitlog/CommitLog-1284496449043.log',
position=68512986)
 INFO [ROW-MUTATION-STAGE:6] 2010-09-14 22:24:33,320 ColumnFamilyStore.java
(line 609) Enqueuing flush of memtable-standa...@798596883(11324628 bytes,
314573 operations)
 INFO [FLUSH-WRITER-POOL:1] 2010-09-14 22:24:33,321 Memtable.java (line 148)
Writing memtable-standa...@798596883(11324628 bytes, 314573 operations)
 INFO [FLUSH-WRITER-POOL:1] 2010-09-14 22:24:38,974 Memtable.java (line 162)
Completed flushing
/data/cassandra/data/TimeFrameClicks/Standard2-1156-Data.db
 INFO [COMPACTION-POOL:1] 2010-09-14 22:25:19,326 CompactionManager.java
(line 246) Compacting
[org.apache.cassandra.io.SSTableReader(path='/data/cassandra/data/TimeFrameClicks/Standard2-1156-Data.db'),org.apache.cassandra.io.SSTableReader(path='/data/cassandra/data/TimeFrameClicks/Standard2-1129-Data.db'),org

Re: Cassandra performance

2010-09-14 Thread Chen Xinli
2010/9/15 Kamil Gorlo 

> Hey,
>
> we are considering using Cassandra for quite large project and because
> of that I made some tests with Cassandra. I was testing performance
> and stability mainly.
>
> My main tool was stress.py for benchmarks (or equivalent written in
> C++ to deal with python2.5 lack of multiprocessing). I will focus only
> on reads (random with normal distribution, what is default in
> stress.py) because writes were /quite/ good.
>
> I have 8 machines (xen quests with dedicated pair of 2TB SATA disks
> combined in RAID-O for every guest). Every machine has 4 individual
> cores of 2.4 Ghz and 4GB RAM.
>
> Cassandra commitlog and data dirs were on the same disk, I gave 2.5GB
> for Heap for Cassandra, key and row cached were disabled (standard
> Keyspace1 schema, all tests use Standard1 CF). All other options were
> defaults. I've disabled cache because I was testing random (or semi
> random - normal distribution) reads so it wouldnt help so much (and
> also because 4GB of RAM is not a lot).
>
> For first test I installed Cassandra on only one machine to test it
> and remember results for further comparisons with large cluster and
> other DBs.
>
> 1) RF was set to 1. I've inserted ~20GB of data (this is number
> reported in load column form nodetool ring output) using stress.py
> (100 colums per row). Then I've tested reads and got 200 rows/second
> (reading 100 columns per row, CL=ONE, disks were bottleneck, util was
> 100%). There was no other operation pending during reads (compaction,
> insertion, etc..).
>
> 2) So I moved to bigger cluster, with 8 machines and RF set to 2. I've
> inserted about ~20GB data per node (so 20 GB * 8 / 2 = 80GB of "real
> data"). Then I've tested reads, exactly te same way as before, and got
> about 450 rows/second (reading 100 columns (but reading only 1 in fact
> makes no difference), CL=ONE, disks on every machine was 100% util
> because of random reads).
>
> 3) Then I changed RF from 2 to 3 on cluster described in 2). So I
> ended with every node loaded with about 30GB of data. Then as usual,
> I've tested reads, and got only 300 rows/second from whole cluster
> (100% util on every disk).
>
> 4) Last test was with RF=3 as before, but I've inserted even more
> data, so every node on 8-machines cluster had ~100GB of data (8 *
> 100GB / 3 = 266GB of real data). In this case I've got only 125
> rows/second.
>
> I was using multiple processes and machines to test reads.
>
>
> *So my question is why these numbers are so low? What is especially
> suprising for me is that changing RF from 2 to 3 drops performance
> from 450 to 300 reads per second. Is this because of read repair?*
>

Yes.
Even for CL=ONE reading, requesting is forward to all replications for
read-repair.
As disk access is your bottleneck, it sounds reasonable that 450 X 2 = 300 X
3.

>
>
> PS. To compare Cassandra performance with other DBs, I've also tested
> MySQL with almost exact data (one table with two columns, key (int PK)
> and value(VARCHAR(500))  simulating 100 columns in Cassandra for
> single row). MySQL was installed on the same machine as Cassandra from
> test 1) (which is one of these 8 machines described before). I've
> inserted some data and then tested random reads (which was even worse
> for caching because I've used standard rand() from C++ to generate
> keys, not normal distribution). Here are results:
>
> size of data in db -> reads per second
> 21 GB  -> 340
> 400 GB -> 200
>
> So I've got more reads from single MySQL with 400GB of data than from
> 8 machines storing about 266GB. This doesn't look good. What am I
> doing wrong? :)
>
> Disable row cache is ok, but key cache should be enabled. It use little
memory, but reading peformance will improve a lot.


Cheers,
> Kamil
>



-- 
Best Regards,
Chen Xinli


Re: Cassandra performance

2010-09-14 Thread Jonathan Ellis
The key is that while Cassandra may read less rows per second than
MySQL when you are i/o bound (as you are here) because of SSTable
merging (see http://wiki.apache.org/cassandra/MemtableSSTable), you
should be using your Cassandra rows as materialized views so that each
query is a single row lookup rather than many.

On Tue, Sep 14, 2010 at 5:40 PM, Kamil Gorlo  wrote:
> Hey,
>
> we are considering using Cassandra for quite large project and because
> of that I made some tests with Cassandra. I was testing performance
> and stability mainly.
>
> My main tool was stress.py for benchmarks (or equivalent written in
> C++ to deal with python2.5 lack of multiprocessing). I will focus only
> on reads (random with normal distribution, what is default in
> stress.py) because writes were /quite/ good.
>
> I have 8 machines (xen quests with dedicated pair of 2TB SATA disks
> combined in RAID-O for every guest). Every machine has 4 individual
> cores of 2.4 Ghz and 4GB RAM.
>
> Cassandra commitlog and data dirs were on the same disk, I gave 2.5GB
> for Heap for Cassandra, key and row cached were disabled (standard
> Keyspace1 schema, all tests use Standard1 CF). All other options were
> defaults. I've disabled cache because I was testing random (or semi
> random - normal distribution) reads so it wouldnt help so much (and
> also because 4GB of RAM is not a lot).
>
> For first test I installed Cassandra on only one machine to test it
> and remember results for further comparisons with large cluster and
> other DBs.
>
> 1) RF was set to 1. I've inserted ~20GB of data (this is number
> reported in load column form nodetool ring output) using stress.py
> (100 colums per row). Then I've tested reads and got 200 rows/second
> (reading 100 columns per row, CL=ONE, disks were bottleneck, util was
> 100%). There was no other operation pending during reads (compaction,
> insertion, etc..).
>
> 2) So I moved to bigger cluster, with 8 machines and RF set to 2. I've
> inserted about ~20GB data per node (so 20 GB * 8 / 2 = 80GB of "real
> data"). Then I've tested reads, exactly te same way as before, and got
> about 450 rows/second (reading 100 columns (but reading only 1 in fact
> makes no difference), CL=ONE, disks on every machine was 100% util
> because of random reads).
>
> 3) Then I changed RF from 2 to 3 on cluster described in 2). So I
> ended with every node loaded with about 30GB of data. Then as usual,
> I've tested reads, and got only 300 rows/second from whole cluster
> (100% util on every disk).
>
> 4) Last test was with RF=3 as before, but I've inserted even more
> data, so every node on 8-machines cluster had ~100GB of data (8 *
> 100GB / 3 = 266GB of real data). In this case I've got only 125
> rows/second.
>
> I was using multiple processes and machines to test reads.
>
>
> *So my question is why these numbers are so low? What is especially
> suprising for me is that changing RF from 2 to 3 drops performance
> from 450 to 300 reads per second. Is this because of read repair?*
>
>
> PS. To compare Cassandra performance with other DBs, I've also tested
> MySQL with almost exact data (one table with two columns, key (int PK)
> and value(VARCHAR(500))  simulating 100 columns in Cassandra for
> single row). MySQL was installed on the same machine as Cassandra from
> test 1) (which is one of these 8 machines described before). I've
> inserted some data and then tested random reads (which was even worse
> for caching because I've used standard rand() from C++ to generate
> keys, not normal distribution). Here are results:
>
> size of data in db -> reads per second
> 21 GB  -> 340
> 400 GB -> 200
>
> So I've got more reads from single MySQL with 400GB of data than from
> 8 machines storing about 266GB. This doesn't look good. What am I
> doing wrong? :)
>
> Cheers,
> Kamil
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Memtable adjusting impact expectations?

2010-09-14 Thread Jonathan Ellis
If your question is, "should I optimize for faster compaction or less
impact on my cluster," the answer is almost always "less impact."

On Tue, Sep 14, 2010 at 8:33 PM, Dathan Pattishall  wrote:
> Okay from what I gather. When data is written its always written to memory.
> The flow for our concerns is the data is written to the commitLog then to
> the memtable.
>
> If any of memtable's 3 tunable thresholds are hit a flush occurs writing the
> data sorted by key to the SSTABLE still enabling sequential disk access. The
> fastest disk access is sequential while random is the slowest - this is just
> a statement to make sure people are on the same page.
>
> Now the data in the SSTABLE is immutable, and when the SSTABLE threshold is
> hit (4 by default) the multiple SSTABLES are turned into a single TABLE,
> this is compaction. Now I can't verify this because I never see more then 1
> SSTABLE at a time, which could just be me misinterpreting the info. This
> Operation singly should produce the most load.
>
> Using JCONSOLE connected to a live Cass box JMX port and forced operations
> from the MBEANS tab. I've quantified that normal operations for my
> environment  is 3% CPU utilization.
> Each Node roughly does 200 operations per second on a standard key/value
> pair sorted by UTF8 with a Random Partitioner.
>
> Doing a Flush for this 10MB Memtable consumes 33% of my CPU from Java's
> point of view and that 33% is sustained for 15 mins - I assume some sorting
> needs to be done according to this link:
> http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.FLUSH-WRITER-POOL
>
> Doing a compaction consumes 30% of my CPU and lasts for 10 mins from the
> subsequent flush.
>
> Now to give some context my MemtableDataSize was around 11,000,000. I am
> going to assume that this is in units of bytes so roughly 10MB of data. Is
> assuming that MemtableDataSize is in units of bytes correct? If so why does
> it take 15 mins to flush 10MB of data to a RAID-10 8 15K RPM disk array with
> a 256 MB of memory and BBC on a dedicated hardware controller?
>
> Additional to this I see in system.log
>
> WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-26 08:27:11,541
> MessageDeserializationTask.java (line 47) dropping message (9,713ms past
> timeout)
>
> 39,625 events are showing. How do I get rid of this? If its a warning and is
> dropping messages how can I fix the warning?
>
>
>
> So, in summary. On Cassandra 0.6.4
> with
> TotalDiskSpaceUsed - 40GB
> MemtableDataSize - 10MB
> MemtableColumnsCount 130K
>
> How do I remove MessageDeserializationTasks warnings?
> How do I reduce the compaction time?
> Why is flushing taking so long?
> What do I look at to tune Memtable thresholds, because this
> http://wiki.apache.org/cassandra/MemtableThresholds does not help, and I
> believe that I'm not in a position where I need to tune the defaults.
>
> http://www.riptano.com/blog/cassandra-annotated-changelog-063 allows me to
> lower the priority of compaction, but to get the best compaction times
> verses system load how do I tune for that?
>
>
>
> Below is some sample output of system.log
>
> Notice that SSTable 1155 is 320 MB, is the sorting taking all the time?
> Because 320MB written even randomly should be very fast. 320 MB written
> sequentially takes 2 seconds, while randomly I would expect 9 seconds. (320
> MB/34 MB/sec random writes == 9 seconds)
>
>
>  INFO [COMPACTION-POOL:1] 2010-09-14 20:40:39,120 CompactionManager.java
> (line 320) Compacted to
> /data/cassandra/data/TimeFrameClicks/Standard2-1155-Data.db.
> 339681221/339679627 bytes for 1253279 keys.  Time: 18645ms.
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,518
> SSTableDeletingReference.java (line 104) Deleted
> /data/cassandra/data/TimeFrameClicks/Standard2-1154-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,547
> SSTableDeletingReference.java (line 104) Deleted
> /data/cassandra/data/TimeFrameClicks/Standard2-1152-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,574
> SSTableDeletingReference.java (line 104) Deleted
> /data/cassandra/data/TimeFrameClicks/Standard2-1153-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,603
> SSTableDeletingReference.java (line 104) Deleted
> /data/cassandra/data/TimeFrameClicks/Standard2-1151-Data.db
>  INFO [ROW-MUTATION-STAGE:6] 2010-09-14 22:24:33,320 ColumnFamilyStore.java
> (line 357) Standard2 has reached its threshold; switching in a fresh
> Memtable at
> CommitLogContext(file='/data/cassandra/commitlog/CommitLog-1284496449043.log',
> position=68512986)
>  INFO [ROW-MUTATION-STAGE:6] 2010-09-14 22:24:33,320 ColumnFamilyStore.java
> (line 609) Enqueuing flush of memtable-standa...@798596883(11324628 bytes,
> 314573 operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-09-14 22:24:33,321 Memtable.java (line 148)
> Writing memtable-standa...@798596883(11324628 bytes, 314573 operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-09-14 22:24:38,974 Memtable.java (line 162)
> Completed fl

Re: Memtable adjusting impact expectations?

2010-09-14 Thread Dathan Pattishall
Yea this was a bit of a read, so

I think what I need really is this

http://www.slideshare.net/driftx/cassandra-summit-2010-performance-tuning




On Tue, Sep 14, 2010 at 6:55 PM, Jonathan Ellis  wrote:

> If your question is, "should I optimize for faster compaction or less
> impact on my cluster," the answer is almost always "less impact."
>
> On Tue, Sep 14, 2010 at 8:33 PM, Dathan Pattishall 
> wrote:
> > Okay from what I gather. When data is written its always written to
> memory.
> > The flow for our concerns is the data is written to the commitLog then to
> > the memtable.
> >
> > If any of memtable's 3 tunable thresholds are hit a flush occurs writing
> the
> > data sorted by key to the SSTABLE still enabling sequential disk access.
> The
> > fastest disk access is sequential while random is the slowest - this is
> just
> > a statement to make sure people are on the same page.
> >
> > Now the data in the SSTABLE is immutable, and when the SSTABLE threshold
> is
> > hit (4 by default) the multiple SSTABLES are turned into a single TABLE,
> > this is compaction. Now I can't verify this because I never see more then
> 1
> > SSTABLE at a time, which could just be me misinterpreting the info. This
> > Operation singly should produce the most load.
> >
> > Using JCONSOLE connected to a live Cass box JMX port and forced
> operations
> > from the MBEANS tab. I've quantified that normal operations for my
> > environment  is 3% CPU utilization.
> > Each Node roughly does 200 operations per second on a standard key/value
> > pair sorted by UTF8 with a Random Partitioner.
> >
> > Doing a Flush for this 10MB Memtable consumes 33% of my CPU from Java's
> > point of view and that 33% is sustained for 15 mins - I assume some
> sorting
> > needs to be done according to this link:
> >
> http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.FLUSH-WRITER-POOL
> >
> > Doing a compaction consumes 30% of my CPU and lasts for 10 mins from the
> > subsequent flush.
> >
> > Now to give some context my MemtableDataSize was around 11,000,000. I am
> > going to assume that this is in units of bytes so roughly 10MB of data.
> Is
> > assuming that MemtableDataSize is in units of bytes correct? If so why
> does
> > it take 15 mins to flush 10MB of data to a RAID-10 8 15K RPM disk array
> with
> > a 256 MB of memory and BBC on a dedicated hardware controller?
> >
> > Additional to this I see in system.log
> >
> > WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-26 08:27:11,541
> > MessageDeserializationTask.java (line 47) dropping message (9,713ms past
> > timeout)
> >
> > 39,625 events are showing. How do I get rid of this? If its a warning and
> is
> > dropping messages how can I fix the warning?
> >
> >
> >
> > So, in summary. On Cassandra 0.6.4
> > with
> > TotalDiskSpaceUsed - 40GB
> > MemtableDataSize - 10MB
> > MemtableColumnsCount 130K
> >
> > How do I remove MessageDeserializationTasks warnings?
> > How do I reduce the compaction time?
> > Why is flushing taking so long?
> > What do I look at to tune Memtable thresholds, because this
> > http://wiki.apache.org/cassandra/MemtableThresholds does not help, and I
> > believe that I'm not in a position where I need to tune the defaults.
> >
> > http://www.riptano.com/blog/cassandra-annotated-changelog-063 allows me
> to
> > lower the priority of compaction, but to get the best compaction times
> > verses system load how do I tune for that?
> >
> >
> >
> > Below is some sample output of system.log
> >
> > Notice that SSTable 1155 is 320 MB, is the sorting taking all the time?
> > Because 320MB written even randomly should be very fast. 320 MB written
> > sequentially takes 2 seconds, while randomly I would expect 9 seconds.
> (320
> > MB/34 MB/sec random writes == 9 seconds)
> >
> >
> >  INFO [COMPACTION-POOL:1] 2010-09-14 20:40:39,120 CompactionManager.java
> > (line 320) Compacted to
> > /data/cassandra/data/TimeFrameClicks/Standard2-1155-Data.db.
> > 339681221/339679627 bytes for 1253279 keys.  Time: 18645ms.
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,518
> > SSTableDeletingReference.java (line 104) Deleted
> > /data/cassandra/data/TimeFrameClicks/Standard2-1154-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,547
> > SSTableDeletingReference.java (line 104) Deleted
> > /data/cassandra/data/TimeFrameClicks/Standard2-1152-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,574
> > SSTableDeletingReference.java (line 104) Deleted
> > /data/cassandra/data/TimeFrameClicks/Standard2-1153-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 21:14:49,603
> > SSTableDeletingReference.java (line 104) Deleted
> > /data/cassandra/data/TimeFrameClicks/Standard2-1151-Data.db
> >  INFO [ROW-MUTATION-STAGE:6] 2010-09-14 22:24:33,320
> ColumnFamilyStore.java
> > (line 357) Standard2 has reached its threshold; switching in a fresh
> > Memtable at
> >
> CommitLogContext(file='/data/cassandra/commitlog/CommitLog-1284496449043.log',
> > posit

Re: Memtable adjusting impact expectations?

2010-09-14 Thread Brandon Williams
On Tue, Sep 14, 2010 at 10:52 PM, Dathan Pattishall wrote:

> Yea this was a bit of a read, so
>
> I think what I need really is this
>
> http://www.slideshare.net/driftx/cassandra-summit-2010-performance-tuning
>
>
http://riptano.blip.tv/file/4011985/ is even better. :)


-Brandon


Re: Cassandra performance

2010-09-14 Thread Kamil Gorlo
Hello,

On Wed, Sep 15, 2010 at 3:45 AM, Chen Xinli  wrote:

[cut]

>>
> Disable row cache is ok, but key cache should be enabled. It use little
> memory, but reading peformance will improve a lot.

Hmm, I've tested with key cache enabled (100%) and I am pretty sure
that this really doesn't help significantly...

Cheers,
Kamil


Re: Cassandra performance

2010-09-14 Thread Kamil Gorlo
Hello,

On Wed, Sep 15, 2010 at 3:53 AM, Jonathan Ellis  wrote:
> The key is that while Cassandra may read less rows per second than
> MySQL when you are i/o bound (as you are here) because of SSTable
> merging (see http://wiki.apache.org/cassandra/MemtableSSTable), you
> should be using your Cassandra rows as materialized views so that each
> query is a single row lookup rather than many.
>

Thanks for your reply Jonathan! Of course there is advantage that for
certain purposes you could store a lot of columns for single row and
performance is still almost the same as with single column.

But to be honest I'm pretty disappointed that Cassandra doesn't really
scale linearly (or "semi-linearly" :)) when adding new machines. I
expected that 8-machines cluster will easily beat single MySQL when
there is much more data than RAM.

Cheers,
Kamil


Re: Cassandra performance

2010-09-14 Thread Oleg Anastasyev
Kamil Gorlo  gmail.com> writes:

> 
> So I've got more reads from single MySQL with 400GB of data than from
> 8 machines storing about 266GB. This doesn't look good. What am I
> doing wrong? :)

The worst case for cassandra is random reads. You should ask youself a question,
do you really have this kind of workload in production ? If you really do, that
means cassandra is not the right tool for the job. Some product based on
berkeley db should work better, e.g. voldemort. Just plain old filesystem is
also good for 100% random reads (if you dont need to backup of course).



max columns number

2010-09-14 Thread Mark Zitnik
HI

What is the max columns number in a key that cassandra supports.

Thanks
-Mark Zitnik