Advice on settings

2010-10-07 Thread Dave Gardner
Hi all

We're rolling out a Cassandra cluster on EC2 and I've got a couple if
questions about settings. I'm interested to hear what other people
have experienced with different values and generally seek advice.

*gcgraceseconds*

Currently we configure one setting for all CFs. We experimented with
this a bit during testing, including changing from the default (10
days) to 3 hours. Our use case involves lots of rewriting the columns
for any given keys. We probably rewrite around 5 million per day.

We are thinking of setting this to around 3 days for production so
that we don't have old copies of data hanging round. Is there anything
obviously wrong with this? Out of curiosity, would there be any
performance issues if we had this set to 30 days? My understanding is
that it would only affect the amount of disk space used.

However Ben Black suggests here that the cleanup will actually only
impact data deleted through the API:

http://comments.gmane.org/gmane.comp.db.cassandra.user/4437

In this case, I guess that we need not worry too much about the
setting since we are actually updating, never deleting. Is this the
case?


*Replication factor*

Our use case is many more writes than reads, but when we do have reads
they're random (we're not currently using hadoop to read entire CFs).
I'm wondering what sort of level of RF to have for a cluster. We
currently have 12 nodes and RF=4.

To improve read performance I'm thinking of upping the number of nodes
and keeping RF at 4. My understanding is that this means we're sharing
the data around more. However it also means a client read to a random
node has less chance of actually connecting to one of the nodes with
the data on. I'm assuming this is fine. What sort of RFs do others
use? With a huge cluster like the recently mentioned 400 node US govt
cluster, what sort of RF is sane?

On a similar note (read perf), I'm guessing that reading at weak
consistency level will bring gains. Gleamed from this slide amongst
other places:

http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13

Is this true, or will read repair still hammer disks in all the
machines with the data on? Again I guess it's better to have low RF so
there are less copied of the data to inspect when doing read repair.
Will this result in better read performance?

Thanks

dave


-- 
*Dave Gardner*
Technical Architect

[image: imagini_58mmX15mm.png]   [image: VisualDNA-Logo-small.png]

*Imagini Europe Limited*
7 Moor Street, London W1D 5NB

[image: phone_icon.png] +44 20 7734 7033
[image: skype_icon.png] daveg79
[image: emailIcon.png] dave.gard...@imagini.net
[image: icon-web.png] http://www.visualdna.com

Imagini Europe Limited, Company number 5565112 (England
and Wales), Registered address: c/o Bird & Bird,
90 Fetter Lane, London, EC4A 1EQ, United Kingdom


Creating and using indices

2010-10-07 Thread Christian Decker
I'm currently trying to get started on secondary indices in Cassandra
0.7.0svn, but without any luck so far. I have the following code that should
create an index on ColA:

KsDef ksDef = client.describe_keyspace("MyKeyspace");
> List cfs = ksDef.cf_defs;
> String columnFamily = "MyCF";
> for(CfDef cf : ksDef.cf_defs){
> if(cf.getName().equals(columnFamily)){
> ColumnDef cd1 = new ColumnDef("ColA".getBytes(),
> "org.apache.cassandra.db.marshal.UTF8Type");
> cd1.index_type = IndexType.KEYS;
> cf.column_metadata.add(cd1);
> // Write changes back to DB
> client.system_update_column_family(cf);
> }
> }
>

which seems to work nicely since when turning up the logging level of
Cassandra it appears to apply some migrations, but then when I try to use a
pycassa client to read an indexed_slice I only get an
InvalidRequestException(why='No indexed columns present in index clause with
operator EQ'):

cf = pycassa.ColumnFamily(client, "MyCF")
> ex = pycassa.index.create_index_expression('ColA', '50',
> pycassa.index.IndexOperator.LTE)
> clause = pycassa.index.create_index_clause([ex])
> cf.get_indexed_slices(clause)
>

 Am I missing something?

Regards,
Chris


Re: Creating and using indices

2010-10-07 Thread Matthew Dennis
If I remember correctly the only operator supported for secondary indexes
right now is EQ, not LTE (or the others).

On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker  wrote:

> I'm currently trying to get started on secondary indices in Cassandra
> 0.7.0svn, but without any luck so far. I have the following code that should
> create an index on ColA:
>
> KsDef ksDef = client.describe_keyspace("MyKeyspace");
>> List cfs = ksDef.cf_defs;
>> String columnFamily = "MyCF";
>> for(CfDef cf : ksDef.cf_defs){
>> if(cf.getName().equals(columnFamily)){
>> ColumnDef cd1 = new ColumnDef("ColA".getBytes(),
>> "org.apache.cassandra.db.marshal.UTF8Type");
>> cd1.index_type = IndexType.KEYS;
>> cf.column_metadata.add(cd1);
>> // Write changes back to DB
>> client.system_update_column_family(cf);
>> }
>> }
>>
>
> which seems to work nicely since when turning up the logging level of
> Cassandra it appears to apply some migrations, but then when I try to use a
> pycassa client to read an indexed_slice I only get an
> InvalidRequestException(why='No indexed columns present in index clause with
> operator EQ'):
>
> cf = pycassa.ColumnFamily(client, "MyCF")
>> ex = pycassa.index.create_index_expression('ColA', '50',
>> pycassa.index.IndexOperator.LTE)
>> clause = pycassa.index.create_index_clause([ex])
>> cf.get_indexed_slices(clause)
>>
>
>  Am I missing something?
>
> Regards,
> Chris
>



-- 
Riptano
Software and Support for Apache Cassandra
http://www.riptano.com/
mden...@riptano.com
m: 512.587.0900 f: 866.583.2068


Re: Creating secondary indices after startup

2010-10-07 Thread Jonathan Ellis
This is not in beta2 but will be in 0.7.0
(https://issues.apache.org/jira/browse/CASSANDRA-1532)

On Thu, Oct 7, 2010 at 7:30 AM,   wrote:
> Hello,
>
> I am trying to work out the new secondary index code on my own, as there
> is no documentation. I've seen the 'Cassandra explained' presentation and
> the tests related to index queries. What I'd like to know is
>
> 1) the basics of the secondary index mechanism (a short human-readable
> description, as the source is confusing me), and
>
> 2) how I can go about implementing support for creating indices on a live
> cluster (if it can be done)
>
> Alexander Altanis
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Creating and using indices

2010-10-07 Thread Petr Odut
What I've tested you must include at least one expression with EQ operator

On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis  wrote:

> If I remember correctly the only operator supported for secondary indexes
> right now is EQ, not LTE (or the others).
>
>
> On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker <
> decker.christ...@gmail.com> wrote:
>
>> I'm currently trying to get started on secondary indices in Cassandra
>> 0.7.0svn, but without any luck so far. I have the following code that should
>> create an index on ColA:
>>
>> KsDef ksDef = client.describe_keyspace("MyKeyspace");
>>> List cfs = ksDef.cf_defs;
>>> String columnFamily = "MyCF";
>>> for(CfDef cf : ksDef.cf_defs){
>>> if(cf.getName().equals(columnFamily)){
>>> ColumnDef cd1 = new ColumnDef("ColA".getBytes(),
>>> "org.apache.cassandra.db.marshal.UTF8Type");
>>> cd1.index_type = IndexType.KEYS;
>>> cf.column_metadata.add(cd1);
>>> // Write changes back to DB
>>> client.system_update_column_family(cf);
>>> }
>>> }
>>>
>>
>> which seems to work nicely since when turning up the logging level of
>> Cassandra it appears to apply some migrations, but then when I try to use a
>> pycassa client to read an indexed_slice I only get an
>> InvalidRequestException(why='No indexed columns present in index clause with
>> operator EQ'):
>>
>> cf = pycassa.ColumnFamily(client, "MyCF")
>>> ex = pycassa.index.create_index_expression('ColA', '50',
>>> pycassa.index.IndexOperator.LTE)
>>> clause = pycassa.index.create_index_clause([ex])
>>> cf.get_indexed_slices(clause)
>>>
>>
>>  Am I missing something?
>>
>> Regards,
>> Chris
>>
>
>
>
> --
> Riptano
> Software and Support for Apache Cassandra
> http://www.riptano.com/
> mden...@riptano.com
> m: 512.587.0900 f: 866.583.2068


Re: Creating and using indices

2010-10-07 Thread Christian Decker
So basically my indices should work? Is there a simple way to check that, so
that we can exclude that?

Are LTE working (or on the roadmap for the 0.7.0 release)? Or does the EQ
operator have to math anything or can I just add an EQ to a not existing
value to get LTE working too?

Regards,
Chris

On Thu, Oct 7, 2010 at 4:57 PM, Petr Odut  wrote:

> What I've tested you must include at least one expression with EQ operator
>
>
> On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis wrote:
>
>> If I remember correctly the only operator supported for secondary indexes
>> right now is EQ, not LTE (or the others).
>>
>>
>> On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker <
>> decker.christ...@gmail.com> wrote:
>>
>>> I'm currently trying to get started on secondary indices in Cassandra
>>> 0.7.0svn, but without any luck so far. I have the following code that should
>>> create an index on ColA:
>>>
>>> KsDef ksDef = client.describe_keyspace("MyKeyspace");
 List cfs = ksDef.cf_defs;
 String columnFamily = "MyCF";
 for(CfDef cf : ksDef.cf_defs){
 if(cf.getName().equals(columnFamily)){
 ColumnDef cd1 = new ColumnDef("ColA".getBytes(),
 "org.apache.cassandra.db.marshal.UTF8Type");
 cd1.index_type = IndexType.KEYS;
 cf.column_metadata.add(cd1);
 // Write changes back to DB
 client.system_update_column_family(cf);
 }
 }

>>>
>>> which seems to work nicely since when turning up the logging level of
>>> Cassandra it appears to apply some migrations, but then when I try to use a
>>> pycassa client to read an indexed_slice I only get an
>>> InvalidRequestException(why='No indexed columns present in index clause with
>>> operator EQ'):
>>>
>>> cf = pycassa.ColumnFamily(client, "MyCF")
 ex = pycassa.index.create_index_expression('ColA', '50',
 pycassa.index.IndexOperator.LTE)
 clause = pycassa.index.create_index_clause([ex])
 cf.get_indexed_slices(clause)

>>>
>>>  Am I missing something?
>>>
>>> Regards,
>>> Chris
>>>
>>
>>
>>
>> --
>> Riptano
>> Software and Support for Apache Cassandra
>> http://www.riptano.com/
>> mden...@riptano.com
>> m: 512.587.0900 f: 866.583.2068
>
>


Re: Creating and using indices

2010-10-07 Thread Tyler Hobbs
Actually, you're trying to add an index to an already existing column family
here, right?

That's not yet supported, but should be soon.

- Tyler

On Thu, Oct 7, 2010 at 10:13 AM, Christian Decker <
decker.christ...@gmail.com> wrote:

> So basically my indices should work? Is there a simple way to check that,
> so that we can exclude that?
>
> Are LTE working (or on the roadmap for the 0.7.0 release)? Or does the EQ
> operator have to math anything or can I just add an EQ to a not existing
> value to get LTE working too?
>
> Regards,
> Chris
>
>
> On Thu, Oct 7, 2010 at 4:57 PM, Petr Odut  wrote:
>
>> What I've tested you must include at least one expression with EQ operator
>>
>>
>> On Thu, Oct 7, 2010 at 3:45 PM, Matthew Dennis wrote:
>>
>>> If I remember correctly the only operator supported for secondary indexes
>>> right now is EQ, not LTE (or the others).
>>>
>>>
>>> On Thu, Oct 7, 2010 at 6:13 AM, Christian Decker <
>>> decker.christ...@gmail.com> wrote:
>>>
 I'm currently trying to get started on secondary indices in Cassandra
 0.7.0svn, but without any luck so far. I have the following code that 
 should
 create an index on ColA:

 KsDef ksDef = client.describe_keyspace("MyKeyspace");
> List cfs = ksDef.cf_defs;
> String columnFamily = "MyCF";
> for(CfDef cf : ksDef.cf_defs){
> if(cf.getName().equals(columnFamily)){
> ColumnDef cd1 = new ColumnDef("ColA".getBytes(),
> "org.apache.cassandra.db.marshal.UTF8Type");
> cd1.index_type = IndexType.KEYS;
> cf.column_metadata.add(cd1);
> // Write changes back to DB
> client.system_update_column_family(cf);
> }
> }
>

 which seems to work nicely since when turning up the logging level of
 Cassandra it appears to apply some migrations, but then when I try to use a
 pycassa client to read an indexed_slice I only get an
 InvalidRequestException(why='No indexed columns present in index clause 
 with
 operator EQ'):

 cf = pycassa.ColumnFamily(client, "MyCF")
> ex = pycassa.index.create_index_expression('ColA', '50',
> pycassa.index.IndexOperator.LTE)
> clause = pycassa.index.create_index_clause([ex])
> cf.get_indexed_slices(clause)
>

  Am I missing something?

 Regards,
 Chris

>>>
>>>
>>>
>>> --
>>> Riptano
>>> Software and Support for Apache Cassandra
>>> http://www.riptano.com/
>>> mden...@riptano.com
>>> m: 512.587.0900 f: 866.583.2068
>>
>>
>


Re: Creating and using indices

2010-10-07 Thread Jonathan Ellis
On Thu, Oct 7, 2010 at 10:13 AM, Christian Decker
 wrote:
> So basically my indices should work? Is there a simple way to check that, so
> that we can exclude that?
>
> Are LTE working (or on the roadmap for the 0.7.0 release)?

No, LT[E] is not on the roadmap for primary index clauses (GT[E] is,
for 0.7.1).  So you would want to create an index with an inverted
comparator, to turn LTE into GTE.

> Or does the EQ
> operator have to math anything or can I just add an EQ to a not existing
> value to get LTE working too?

If you ask for EQ not-existing-value you will get no results back, of course.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Advice on settings

2010-10-07 Thread Peter Schuller
> However Ben Black suggests here that the cleanup will actually only
> impact data deleted through the API:
>
> http://comments.gmane.org/gmane.comp.db.cassandra.user/4437
>
> In this case, I guess that we need not worry too much about the
> setting since we are actually updating, never deleting. Is this the
> case?

Yes, that's correct. GCGraceSeconds affects the lifetime of
tombstones, which are needed only when deleting data. SImple
overwrites do not involve tombstones, and GCGraceSeconds is not in
play. Overwritten columns are eliminated when their sstables are
compacted.

> *Replication factor*
>
> Our use case is many more writes than reads, but when we do have reads
> they're random (we're not currently using hadoop to read entire CFs).
> I'm wondering what sort of level of RF to have for a cluster. We
> currently have 12 nodes and RF=4.
>
> To improve read performance I'm thinking of upping the number of nodes
> and keeping RF at 4.

In the absence of other bottlenecks, this makes sense yes.

Another thing to consider is whether to turn off (if on 0.6) or adjust
the frequency of (in 0.7) read-repair. If read repair is turned on
(0.6) or at 100% (0.7), each read will hit RF numbers of nodes (even
if you are reading at a low consistency level, with read repair, other
nodes are still asked to read the data and send back a checksum). If
you expect to be I/O bound to due low locality of access (the random
reads), this could potentially yield up to a factor of RF (in your
case 4) expected read throughput.

Whether or not turning off or decreasing read repair is acceptable is
of course up to your situation; and in particular if you read at e.q.
QUOROM you will still read from 3 (in the case of RF=4) nodes
regardless of read repair settings.

> My understanding is that this means we're sharing
> the data around more.

Not sure what you mean. Given a constant RF of 4, you will still have
4 copies, but they will be distributed across additional machines,
meaning each machine has less data and presumably gets less requests.

> However it also means a client read to a random
> node has less chance of actually connecting to one of the nodes with
> the data on.

Keep in mind though that hitting the right node is somewhat of a
special case, and the overhead is limited to whatever the cost of RPC
is. If you are expecting to bottleneck on disk seeks (judging again by
your random read comment), I would say you can completely ignore this.
When I say it's a special case, I mean that you're adding between 0
and 1 units of RPC overhead (on average); no matter how large your
cluster is, your RPC overhead is won't exceed 1, with 1 being whatever
the cost is to forward a request+response.

> On a similar note (read perf), I'm guessing that reading at weak
> consistency level will bring gains. Gleamed from this slide amongst
> other places:
>
> http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13
>
> Is this true, or will read repair still hammer disks in all the
> machines with the data on? Again I guess it's better to have low RF so
> there are less copied of the data to inspect when doing read repair.
> Will this result in better read performance?

Sorry, I did the impolite thing and began responding before having
read your entire E-Mail ;)

So yes, a low RF would increase read performance, but assuming you
care about data redundancy the better way to achieve that effect is
probably to decrease or disable read repair.

-- 
/ Peter Schuller


Re: Advice on settings

2010-10-07 Thread B. Todd Burruss
if you are updating columns quite rapidly, you will scatter the columns 
over many sstables as you update them over time.  this means that a read 
of a specific column will require looking at more sstables to find the 
data.  performing a compaction (using nodetool) will merge the sstables 
into one making your reads more performant.  of course the more columns, 
the more scattering around, the more I/O.


to your point about "sharing the data around".  adding more machines is 
always a good thing to spread the load - you add RAM, CPU, and 
persistent storage to the cluster.  there probably is some point where 
enough machines creates a lot of network traffic, but 10 or 20 machines 
shouldn't be an issue.  don't worry about trying to hit a node that has 
the data unless your machines are connected across slow network links.


On 10/07/2010 12:48 AM, Dave Gardner wrote:

Hi all

We're rolling out a Cassandra cluster on EC2 and I've got a couple if
questions about settings. I'm interested to hear what other people
have experienced with different values and generally seek advice.

*gcgraceseconds*

Currently we configure one setting for all CFs. We experimented with
this a bit during testing, including changing from the default (10
days) to 3 hours. Our use case involves lots of rewriting the columns
for any given keys. We probably rewrite around 5 million per day.

We are thinking of setting this to around 3 days for production so
that we don't have old copies of data hanging round. Is there anything
obviously wrong with this? Out of curiosity, would there be any
performance issues if we had this set to 30 days? My understanding is
that it would only affect the amount of disk space used.

However Ben Black suggests here that the cleanup will actually only
impact data deleted through the API:

http://comments.gmane.org/gmane.comp.db.cassandra.user/4437

In this case, I guess that we need not worry too much about the
setting since we are actually updating, never deleting. Is this the
case?


*Replication factor*

Our use case is many more writes than reads, but when we do have reads
they're random (we're not currently using hadoop to read entire CFs).
I'm wondering what sort of level of RF to have for a cluster. We
currently have 12 nodes and RF=4.

To improve read performance I'm thinking of upping the number of nodes
and keeping RF at 4. My understanding is that this means we're sharing
the data around more. However it also means a client read to a random
node has less chance of actually connecting to one of the nodes with
the data on. I'm assuming this is fine. What sort of RFs do others
use? With a huge cluster like the recently mentioned 400 node US govt
cluster, what sort of RF is sane?

On a similar note (read perf), I'm guessing that reading at weak
consistency level will bring gains. Gleamed from this slide amongst
other places:

http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13

Is this true, or will read repair still hammer disks in all the
machines with the data on? Again I guess it's better to have low RF so
there are less copied of the data to inspect when doing read repair.
Will this result in better read performance?

Thanks

dave


   


Re: Advice on settings

2010-10-07 Thread Dave Viner
Also, as a note related to EC2, choose whether you want to be in multiple
availability zones.  The highest performance possible is to be in a single
AZ, as all those machines will have *very* high speed interconnects.  But,
individual AZs also can suffer outages.  You can distribute your instances
across, say, 2 AZs, and then use a RackAwareStrategy to force replication to
put at least 1 copy of the data into the other AZ.

Also, it's easiest to stay within a single Region (in EC2-speak).  This
allows you to use the internal IP addresses for Gossip and Thrift
connections - which means you do not pay inbound-outbound fees for the data
xfer.

HTH,
Dave Viner


On Thu, Oct 7, 2010 at 10:26 AM, B. Todd Burruss  wrote:

> if you are updating columns quite rapidly, you will scatter the columns
> over many sstables as you update them over time.  this means that a read of
> a specific column will require looking at more sstables to find the data.
>  performing a compaction (using nodetool) will merge the sstables into one
> making your reads more performant.  of course the more columns, the more
> scattering around, the more I/O.
>
> to your point about "sharing the data around".  adding more machines is
> always a good thing to spread the load - you add RAM, CPU, and persistent
> storage to the cluster.  there probably is some point where enough machines
> creates a lot of network traffic, but 10 or 20 machines shouldn't be an
> issue.  don't worry about trying to hit a node that has the data unless your
> machines are connected across slow network links.
>
>
> On 10/07/2010 12:48 AM, Dave Gardner wrote:
>
>> Hi all
>>
>> We're rolling out a Cassandra cluster on EC2 and I've got a couple if
>> questions about settings. I'm interested to hear what other people
>> have experienced with different values and generally seek advice.
>>
>> *gcgraceseconds*
>>
>> Currently we configure one setting for all CFs. We experimented with
>> this a bit during testing, including changing from the default (10
>> days) to 3 hours. Our use case involves lots of rewriting the columns
>> for any given keys. We probably rewrite around 5 million per day.
>>
>> We are thinking of setting this to around 3 days for production so
>> that we don't have old copies of data hanging round. Is there anything
>> obviously wrong with this? Out of curiosity, would there be any
>> performance issues if we had this set to 30 days? My understanding is
>> that it would only affect the amount of disk space used.
>>
>> However Ben Black suggests here that the cleanup will actually only
>> impact data deleted through the API:
>>
>> http://comments.gmane.org/gmane.comp.db.cassandra.user/4437
>>
>> In this case, I guess that we need not worry too much about the
>> setting since we are actually updating, never deleting. Is this the
>> case?
>>
>>
>> *Replication factor*
>>
>> Our use case is many more writes than reads, but when we do have reads
>> they're random (we're not currently using hadoop to read entire CFs).
>> I'm wondering what sort of level of RF to have for a cluster. We
>> currently have 12 nodes and RF=4.
>>
>> To improve read performance I'm thinking of upping the number of nodes
>> and keeping RF at 4. My understanding is that this means we're sharing
>> the data around more. However it also means a client read to a random
>> node has less chance of actually connecting to one of the nodes with
>> the data on. I'm assuming this is fine. What sort of RFs do others
>> use? With a huge cluster like the recently mentioned 400 node US govt
>> cluster, what sort of RF is sane?
>>
>> On a similar note (read perf), I'm guessing that reading at weak
>> consistency level will bring gains. Gleamed from this slide amongst
>> other places:
>>
>>
>> http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13
>>
>> Is this true, or will read repair still hammer disks in all the
>> machines with the data on? Again I guess it's better to have low RF so
>> there are less copied of the data to inspect when doing read repair.
>> Will this result in better read performance?
>>
>> Thanks
>>
>> dave
>>
>>
>>
>>
>


Heap Settings suggestions

2010-10-07 Thread kannan chandrasekaran
>From the Cassandra documentation @ riptano I see the following recommendation 
for Heap size setting

MemtableThroughputInMB * 3 * (number of ColumnFamilies) + 1G + (size of 
internal 
caches)

What if there is more than one keyspace in the system ? Assuming each keyspace 
has the same number of column families, Can I linearly scale the above 
recommendation to the number of keyspaces in the system .ie, if the "X" is the 
heap size for a single keyspace and there are "Y" keyspaces, Is it recommended 
to allocate "XY" as the max  Heap size ?  Please let me know.


Thanks
Kannan

PS: Thanks a lot for the documentation and recommendations.



  

Re: Heap Settings suggestions

2010-10-07 Thread Peter Schuller
> What if there is more than one keyspace in the system ? Assuming each
> keyspace has the same number of column families, Can I linearly scale the
> above recommendation to the number of keyspaces in the system .ie, if the
> "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is
> it recommended to allocate "XY" as the max  Heap size ?  Please let me know.

Yes. Each column family will have a memtable subject to the configured
memory constraints; whether or not they are in different keyspaces
does not matter.

-- 
/ Peter Schuller


Re: Tuning cassandra to use less memory

2010-10-07 Thread Peter Schuller
> The nodes are still swapping, even though the swappiness is set to zero
> right now. After swapping comes the OOM.

In addition to what's already been said, consider just flat out
disabling swap completely, unless you have other things on the machine
that cause swap to be significantly useful (i.e., lots of truly unused
stuff that is good to keep swapped out).


-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-10-07 Thread Peter Schuller
>  There's some words on the 'Net that - the recent pages on
>  Riptano's site in fact - that strongly encourage scaling left
>  and right, rather than beefing up the boxes - and certainly
>  we're seeing far less bother from GC using a much smaller
>  heap - previously we'd been going up to 16GB, or even
>  higher.  This is based on my previous positive experiences
>  of getting better performance from memory hog apps (eg.
>  Java) by giving them more memory.  In any case, it seems
>  that using large amounts of memory on EC2 is just asking
>  for trouble.

Keep in mind that while GC tends to be more efficient with larger heap
sizes, that does not always translate into better overall performance
when other things have to be considered. In particular, in the case of
Cassandra, if you "waste" 10-15 gigs of RAM on the JVM heap for a
Cassandra instances which could live with e.g. 1 GB, you're actively
taking away those 10-15 gigs of RAM from the operating system to use
for the buffer cache. Particularly if you're I/O bound on reads then,
this could have very detrimental effects (assuming the data set is
sufficiently small and locality is such that 15 GB of extra buffer
cache makes a difference; usually, but not always, this is the case).

So with Cassandra, in the general case, you definitely want to keep
hour heap size reasonable in relation to the actual live set (amount
of actually reachable data), rather than just cranking it up as much
as possible.

(The main issue here is also keeping it high enough to not OOM, given
that exact memory demands are hard to predict; it would be absolutely
great if the JVM was better at maintaining a reasonable heap size to
live set size ratio so that much less tweaking of heap sizes was
necessary, but this is not the case.)

-- 
/ Peter Schuller


Retrieving dead node's token from system keyspace

2010-10-07 Thread Allan Carroll
Hey all, 

I had a node go down that I'm not able to get a token for from nodetool ring.

The wiki says:

"You can obtain the dead node's token by running nodetool ring on any live 
node, unless there was some kind of outage, and the others came up but not the 
down one -- in that case, you can retrieve the token from the live nodes' 
system tables."

But, I can't for the life of me figure out how to get the system keyspace to 
give up the secret. All attempts end up in:

ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) 
Internal error processing get_slice
java.lang.RuntimeException: No replica strategy configured for system


Can someone point me at a good way to get the token?

Thanks
-Allan

Re: Retrieving dead node's token from system keyspace

2010-10-07 Thread Allan Carroll
I was able to figure out to use the sstable2json tool to get the values out of 
the system keyspace.

Unfortunately, the node that went down took all of it's data with it and I only 
have access to the system keyspace of the remaining live node. There were only 
two nodes and the one left should have a whole DB copy.

Running removetoken on any of the values that appeared to be tokens in the 
LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the 
token of the dead node? Or, is there a way to take down the last node and bring 
back up a new cluster using the sstables that I have on the remaining node?

-Allan

On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote:

> Hey all, 
> 
> I had a node go down that I'm not able to get a token for from nodetool ring.
> 
> The wiki says:
> 
> "You can obtain the dead node's token by running nodetool ring on any live 
> node, unless there was some kind of outage, and the others came up but not 
> the down one -- in that case, you can retrieve the token from the live nodes' 
> system tables."
> 
> But, I can't for the life of me figure out how to get the system keyspace to 
> give up the secret. All attempts end up in:
> 
> ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) 
> Internal error processing get_slice
> java.lang.RuntimeException: No replica strategy configured for system
> 
> 
> Can someone point me at a good way to get the token?
> 
> Thanks
> -Allan



Cassandra and EC2 performance testing

2010-10-07 Thread Corey Hulen
I recently posted a blog article about Cassandra and EC2 performance testing
for small vs large, EBS vs ephemeral storage, compared to real HW with and
without an SSD.  Hope people find it interesting.

http://www.coreyhulen.org/?p=326

Highlights:

   - The variance in test results from run to run on EC2’s virtual hardware
   fluctuates A LOT.
   - EC2 is a finicky beast, but we like it.
   - Not all EC2 instances (for the same size ie. small) are created equal.
   - Large instances are not 4x as fast as small instances (even though they
   are 4x the price).
   - Kind of obvious, but real hardware is better…and yea SSD’s kick butt.
   - Automated scripts included.  Please have at it and reproduce the
   results with different configurations.


Thanks,

-Corey


Re: Heap Settings suggestions

2010-10-07 Thread Matthew Dennis
Keep in mind that .7 and on will have per-CF settings for most things so
there will be even more control over the the tuning...
On Oct 7, 2010 3:10 PM, "Peter Schuller" 
wrote:
>> What if there is more than one keyspace in the system ? Assuming each
>> keyspace has the same number of column families, Can I linearly scale the
>> above recommendation to the number of keyspaces in the system .ie, if the
>> "X" is the heap size for a single keyspace and there are "Y" keyspaces,
Is
>> it recommended to allocate "XY" as the max  Heap size ?  Please let me
know.
>
> Yes. Each column family will have a memtable subject to the configured
> memory constraints; whether or not they are in different keyspaces
> does not matter.
>
> --
> / Peter Schuller


Re: Tuning cassandra to use less memory

2010-10-07 Thread Matthew Dennis
+1 on disabling swap
On Oct 7, 2010 3:27 PM, "Peter Schuller" 
wrote:
>> The nodes are still swapping, even though the swappiness is set to zero
>> right now. After swapping comes the OOM.
>
> In addition to what's already been said, consider just flat out
> disabling swap completely, unless you have other things on the machine
> that cause swap to be significantly useful (i.e., lots of truly unused
> stuff that is good to keep swapped out).
>
>
> --
> / Peter Schuller


RE: Newbie Question about restarting Cassandra

2010-10-07 Thread David McIntosh
Are there any data loss concerns if you have the commit log sync set to
periodic and are writing with CL One or Any?

 

From: Matthew Dennis [mailto:mden...@riptano.com] 
Sent: Wednesday, October 06, 2010 8:53 PM
To: user@cassandra.apache.org
Subject: Re: Newbie Question about restarting Cassandra

 

Rob is correct.

drain is really on there for when you need the commit log to be empty (some
upgrades or a complete backup of a shutdown cluster).

There really is no point to using to shutdown C* normally, just kill it...

On Wed, Oct 6, 2010 at 4:18 PM, Rob Coli  wrote:

On 10/6/10 1:13 PM, Aaron Morton wrote:

To shutdown cleanly, say in a production system, use nodetool drain
first. This will flush the memtables and put the node into a read only
mode, AFAIK this also gives the other nodes a faster way of detecting
the node is down via the drained node gossiping it's new status. Then kill.

 

FWIW, the gossiper related code for "drain" (trunk) looks like it just stops
the gossip service, which is almost certainly the same thing that happens if
you kill Cassandra.

./src/java/org/apache/cassandra/service/StorageService.java
"
   public synchronized void drain() throws IOException,
InterruptedException, ExecutionException
...
 setMode("Starting drain process", true);
   Gossiper.instance.stop();
"

./src/java/org/apache/cassandra/gms/Gossiper.java
"
 public void stop()
   {
   scheduledGossipTask.cancel(false);
   }
"

=Rob




-- 
Riptano
Software and Support for Apache Cassandra
http://www.riptano.com/
mden...@riptano.com
m: 512.587.0900 f: 866.583.2068



Re: Retrieving dead node's token from system keyspace

2010-10-07 Thread Aaron Morton
Allan, I'm a bit confused about what you are trying to do here. You have 2 nodes with RF = ? , you lost one node completely and now you want to...Just get a cluster running again, don't worry about the data.ORRestore the data from the dead node. ORCreate a cluster with the data from the remaining node and a new node.AaronOn 08 Oct, 2010,at 11:15 AM, Allan Carroll  wrote:I was able to figure out to use the sstable2json tool to get the values out of the system keyspace.

Unfortunately, the node that went down took all of it's data with it and I only have access to the system keyspace of the remaining live node. There were only two nodes and the one left should have a whole DB copy.

Running removetoken on any of the values that appeared to be tokens in the LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the token of the dead node? Or, is there a way to take down the last node and bring back up a new cluster using the sstables that I have on the remaining node?

-Allan

On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote:

> Hey all, 
> 
> I had a node go down that I'm not able to get a token for from nodetool ring.
> 
> The wiki says:
> 
> "You can obtain the dead node's token by running nodetool ring on any live node, unless there was some kind of outage, and the others came up but not the down one -- in that case, you can retrieve the token from the live nodes' system tables."
> 
> But, I can't for the life of me figure out how to get the system keyspace to give up the secret. All attempts end up in:
> 
> ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line 1280) Internal error processing get_slice
> java.lang.RuntimeException: No replica strategy configured for system
> 
> 
> Can someone point me at a good way to get the token?
> 
> Thanks
> -Allan



Re: Newbie Question about restarting Cassandra

2010-10-07 Thread Matthew Dennis
Yes.

You probably shouldn't ever be using CL.ANY (though I'm certain there are
others that disagree with me; I wish them the best of luck with that).

CL.ONE + periodic sync can potentially lose recently written data, but if
you care about that then you better care enough about your data to use
something greater than CL.ONE.  With CL.ONE + periodic: If your disk dies
you lose data.  If your OS crashes on a node you lose data.  If the
processor melts you lose data.  If your memory goes bad you lose data.  If
your UPS is interrupted you lose data.  if X (for many values of X) you lose
data.  Not that for some situations (e.g. disk failure) it doesn't matter
what the commit log sync (batch v periodic) is set to, you lose data.

If your C* process dies and/or is killed you should not lose data.  It's
written to the commit log before the client is acked, however that entry may
not have made it to disk yet in the case of commitlogsync=periodic.  So, if
you kill the C* process you're fine.  If you nicely restart the OS, you
should be fine (assuming your boxen/raid controllers/disks/etc do the sane
thing).  If you nuke your OS, then see above about losing data on CL.ONE.


On Thu, Oct 7, 2010 at 7:11 PM, David McIntosh  wrote:

>  Are there any data loss concerns if you have the commit log sync set to
> periodic and are writing with CL One or Any?
>
>
>
> *From:* Matthew Dennis [mailto:mden...@riptano.com]
> *Sent:* Wednesday, October 06, 2010 8:53 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Newbie Question about restarting Cassandra
>
>
>
> Rob is correct.
>
> drain is really on there for when you need the commit log to be empty (some
> upgrades or a complete backup of a shutdown cluster).
>
> There really is no point to using to shutdown C* normally, just kill it...
>
> On Wed, Oct 6, 2010 at 4:18 PM, Rob Coli  wrote:
>
> On 10/6/10 1:13 PM, Aaron Morton wrote:
>
> To shutdown cleanly, say in a production system, use nodetool drain
> first. This will flush the memtables and put the node into a read only
> mode, AFAIK this also gives the other nodes a faster way of detecting
> the node is down via the drained node gossiping it's new status. Then kill.
>
>
>
> FWIW, the gossiper related code for "drain" (trunk) looks like it just
> stops the gossip service, which is almost certainly the same thing that
> happens if you kill Cassandra.
>
> ./src/java/org/apache/cassandra/service/StorageService.java
> "
>public synchronized void drain() throws IOException,
> InterruptedException, ExecutionException
> ...
>  setMode("Starting drain process", true);
>Gossiper.instance.stop();
> "
>
> ./src/java/org/apache/cassandra/gms/Gossiper.java
> "
>  public void stop()
>{
>scheduledGossipTask.cancel(false);
>}
> "
>
> =Rob
>
>
>
>
> --
> Riptano
> Software and Support for Apache Cassandra
> http://www.riptano.com/
> mden...@riptano.com
> m: 512.587.0900 f: 866.583.2068
>


Re: Retrieving dead node's token from system keyspace

2010-10-07 Thread Matthew Dennis
Allan,

I'm confused on why removetoken doesn't do anything and would be interested
in finding out why, but to answer your question:

You can shutdown down your last node, nuke the system directory (make a
backup just in case), restart the node, load the schema (export it first if
need be) and be one your way.  You should end up with a node that is the
only one in the ring.  Again, make a backup of the the system directory
(actually, might as well just backup the entire data and commitlog
directories) before you start nuking stuff.

On Thu, Oct 7, 2010 at 7:12 PM, Aaron Morton wrote:

> Allan,
> I'm a bit confused about what you are trying to do here. You have 2 nodes
> with RF = ? , you lost one node completely and now you want to...
>
> Just get a cluster running again, don't worry about the data.
> OR
> Restore the data from the dead node.
> OR
> Create a cluster with the data from the remaining node and a new node.
>
> Aaron
>
>
> On 08 Oct, 2010,at 11:15 AM, Allan Carroll  wrote:
>
> I was able to figure out to use the sstable2json tool to get the values out
> of the system keyspace.
>
> Unfortunately, the node that went down took all of it's data with it and I
> only have access to the system keyspace of the remaining live node. There
> were only two nodes and the one left should have a whole DB copy.
>
> Running removetoken on any of the values that appeared to be tokens in the
> LocationInfo cf hasn't done any good. Perhaps I'm missing which value is the
> token of the dead node? Or, is there a way to take down the last node and
> bring back up a new cluster using the sstables that I have on the remaining
> node?
>
> -Allan
>
> On Oct 7, 2010, at 3:22 PM, Allan Carroll wrote:
>
> > Hey all,
> >
> > I had a node go down that I'm not able to get a token for from nodetool
> ring.
> >
> > The wiki says:
> >
> > "You can obtain the dead node's token by running nodetool ring on any
> live node, unless there was some kind of outage, and the others came up but
> not the down one -- in that case, you can retrieve the token from the live
> nodes' system tables."
> >
> > But, I can't for the life of me figure out how to get the system keyspace
> to give up the secret. All attempts end up in:
> >
> > ERROR [pool-1-thread-2] 2010-10-07 21:20:44,865 Cassandra.java (line
> 1280) Internal error processing get_slice
> > java.lang.RuntimeException: No replica strategy configured for system
> >
> >
> > Can someone point me at a good way to get the token?
> >
> > Thanks
> > -Allan
>
>


-- 
Riptano
Software and Support for Apache Cassandra
http://www.riptano.com/
mden...@riptano.com
m: 512.587.0900 f: 866.583.2068


Re: Heap Settings suggestions

2010-10-07 Thread kannan chandrasekaran
Good point..

Thanks to both of you for the replies.
Kannan





From: Matthew Dennis 
To: user@cassandra.apache.org
Sent: Thu, October 7, 2010 4:59:28 PM
Subject: Re: Heap Settings suggestions


Keep in mind that .7 and on will have per-CF settings for most things so there 
will be even more control over the the tuning...
On Oct 7, 2010 3:10 PM, "Peter Schuller"  wrote:
>> What if there is more than one keyspace in the system ? Assuming each
>> keyspace has the same number of column families, Can I linearly scale the
>> above recommendation to the number of keyspaces in the system .ie, if the
>> "X" is the heap size for a single keyspace and there are "Y" keyspaces, Is
>> it recommended to allocate "XY" as the max  Heap size ?  Please let me know.
> 
> Yes. Each column family will have a memtable subject to the configured
> memory constraints; whether or not they are in different keyspaces
> does not matter.
> 
> -- 
> / Peter Schuller



  

Re: Dazed and confused with Cassandra on EC2 ...

2010-10-07 Thread Matthew Dennis
Also, in general, you probably want to set Xms = Xmx (regardless of the
value you eventually decide on for that).

If you set them equal, the JVM will just go ahead and allocate that amount
on startup.  If they're different, then when you grow above Xms it has to
allocate more and move a bunch of stuff around.  It may have to do this
multiple times.  Note that it does this as the worst time possible (i.e.
under heavy load, which is likely what caused you to grow past Xms in the
first place).

On Thu, Oct 7, 2010 at 2:49 PM, Peter Schuller
wrote:

> >  There's some words on the 'Net that - the recent pages on
> >  Riptano's site in fact - that strongly encourage scaling left
> >  and right, rather than beefing up the boxes - and certainly
> >  we're seeing far less bother from GC using a much smaller
> >  heap - previously we'd been going up to 16GB, or even
> >  higher.  This is based on my previous positive experiences
> >  of getting better performance from memory hog apps (eg.
> >  Java) by giving them more memory.  In any case, it seems
> >  that using large amounts of memory on EC2 is just asking
> >  for trouble.
>
> Keep in mind that while GC tends to be more efficient with larger heap
> sizes, that does not always translate into better overall performance
> when other things have to be considered. In particular, in the case of
> Cassandra, if you "waste" 10-15 gigs of RAM on the JVM heap for a
> Cassandra instances which could live with e.g. 1 GB, you're actively
> taking away those 10-15 gigs of RAM from the operating system to use
> for the buffer cache. Particularly if you're I/O bound on reads then,
> this could have very detrimental effects (assuming the data set is
> sufficiently small and locality is such that 15 GB of extra buffer
> cache makes a difference; usually, but not always, this is the case).
>
> So with Cassandra, in the general case, you definitely want to keep
> hour heap size reasonable in relation to the actual live set (amount
> of actually reachable data), rather than just cranking it up as much
> as possible.
>
> (The main issue here is also keeping it high enough to not OOM, given
> that exact memory demands are hard to predict; it would be absolutely
> great if the JVM was better at maintaining a reasonable heap size to
> live set size ratio so that much less tweaking of heap sizes was
> necessary, but this is not the case.)
>
> --
> / Peter Schuller
>


Out of Memory Issues - SERIOUS

2010-10-07 Thread Dan Hendry
There seems to have been a fair amount of discussion on memory related
issues so I apologize if this exact situation has come up before. 

 

I am currently in the process of load testing an metrics platform I have
written which uses Cassandra and I have run into some very troubling issues.
The application is writing quite heavily, about 1000-2000 updates (columns)
per second using batch mutates of 20 columns each. This is divided between
creating new rows and adding columns to a fairly limited number of existing
index rows (<30). Nearly all of these updates are read within 10 seconds and
none contain any significant amount of data (generally much less than 100
bytes of data which I specify). Initially, the test hums along nicely but
after some amount of time (1-2 hours) Cassandra crashes with an out of
memory error. Unfortunately I have not had the opportunity to watch the test
as it crashes, but it has happened in 2/2 tests.

 

This is quite annoying but the absolutely TERRIFYING behaviour is that when
I restart Cassandra, it starts replaying the commit logs then crashes with
an out of memory error again. Restart a second time, crash with OOM; it
seems to get through about 3/4 of the commit logs. Just to be absolutely
explicit, I am not trying to insert or read at this point, just recover the
previous updates. Unless somebody can suggest a way to recover the commit
logs, I have effectively lost my data. The only way I have found to recover
is wipe the data directories. It does not matter right now given that it is
only a test but this behaviour is completely unacceptable for a production
system. 

 

Here is information about the system which is probably relevant. Let me know
if any additional details about my application would help sort out this
issue:

-  Cassandra 0.7 Beta2

-  DB Machine: EC2 m1 large with the commit log directory on an ebs
and the data directory on ephemeral storage.

-  OS: Ubuntu server 10.04

-  With the exception of changing JMX settings, no memory or JVM
changes were made to options in cassandra-env.sh

-  In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to
100 in my second test to try follow the heap memory calculation formula; I
have 8 column families.

-  I am using the Sun JVM, specifically "build 1.6.0_20-b02"

-  The app is written in java and I am using the latest Pelops
library, I am sending updates at consistency level ONE and reading them at
level ALL.

 

I have been fairly impressed with Cassandra overall and given that I am
using a beta version, I don't expect fully polished behaviour. What is
unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra
cant seem to recover from the error and I am loosing data.

 

Dan Hendry



Re: Out of Memory Issues - SERIOUS

2010-10-07 Thread Jonathan Ellis
if you don't want to lose data, don't wipe your commit logs.  that
part seems pretty obvious to me. :)

cassandra aggressively logs its state when it is running out of memory
so you can troubleshoot.  look for the GCInspector lines in the log.

but in this case it sounds pretty simple; you will be able to finish
replaying the commitlogs if you lower your memtable thresholds or
alternatively increase the amount of memory given to the JVM.  (see
http://wiki.apache.org/cassandra/MemtableSSTable.)

the _binary_ memtable setting has no effect on commitlog replay (it
has no effect on anything but binary writes through the storageproxy
api, which you are not using), you need to adjust
memtable_throughput_in_mb and memtable_operations_in_millions.

If you haven't explicitly set these then Cassandra will guess based on
your heap size; here, it is guessing too high.  start by uncommenting
the settings in the .yaml and reduce by 50% until it works.
alternatively, apply the patch at
https://issues.apache.org/jira/browse/CASSANDRA-1595 to see what
Cassandra is guessing, and start at half of that.

On Thu, Oct 7, 2010 at 10:32 PM, Dan Hendry  
wrote:
> There seems to have been a fair amount of discussion on memory related
> issues so I apologize if this exact situation has come up before.
>
>
>
> I am currently in the process of load testing an metrics platform I have
> written which uses Cassandra and I have run into some very troubling issues.
> The application is writing quite heavily, about 1000-2000 updates (columns)
> per second using batch mutates of 20 columns each. This is divided between
> creating new rows and adding columns to a fairly limited number of existing
> index rows (<30). Nearly all of these updates are read within 10 seconds and
> none contain any significant amount of data (generally much less than 100
> bytes of data which I specify). Initially, the test hums along nicely but
> after some amount of time (1-2 hours) Cassandra crashes with an out of
> memory error. Unfortunately I have not had the opportunity to watch the test
> as it crashes, but it has happened in 2/2 tests.
>
>
>
> This is quite annoying but the absolutely TERRIFYING behaviour is that when
> I restart Cassandra, it starts replaying the commit logs then crashes with
> an out of memory error again. Restart a second time, crash with OOM; it
> seems to get through about 3/4 of the commit logs. Just to be absolutely
> explicit, I am not trying to insert or read at this point, just recover the
> previous updates. Unless somebody can suggest a way to recover the commit
> logs, I have effectively lost my data. The only way I have found to recover
> is wipe the data directories. It does not matter right now given that it is
> only a test but this behaviour is completely unacceptable for a production
> system.
>
>
>
> Here is information about the system which is probably relevant. Let me know
> if any additional details about my application would help sort out this
> issue:
>
> -  Cassandra 0.7 Beta2
>
> -  DB Machine: EC2 m1 large with the commit log directory on an ebs
> and the data directory on ephemeral storage.
>
> -  OS: Ubuntu server 10.04
>
> -  With the exception of changing JMX settings, no memory or JVM
> changes were made to options in cassandra-env.sh
>
> -  In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to
> 100 in my second test to try follow the heap memory calculation formula; I
> have 8 column families.
>
> -  I am using the Sun JVM, specifically “build 1.6.0_20-b02”
>
> -  The app is written in java and I am using the latest Pelops
> library, I am sending updates at consistency level ONE and reading them at
> level ALL.
>
>
>
> I have been fairly impressed with Cassandra overall and given that I am
> using a beta version, I don’t expect fully polished behaviour. What is
> unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra
> cant seem to recover from the error and I am loosing data.
>
>
>
> Dan Hendry



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com