Re: how large cassandra could scale when it need to do manual operation?

2011-07-09 Thread Yan Chunlu
thank you very much for the reply. which brings me more confidence on
cassandra.
I will try the automation tools, the examples you've listed seems quite
promising!


about the decommission problem, here is the link:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
 I am also trying to deploy cassandra across two datacenters(with 20ms
latency). so I am worrying about the network latency will even make it
worse.

maybe I was misunderstanding the replication factor, doesn't it RF=3 means I
could lose two nodes and still have one available(with 100% of the keys),
once Nodes>=3?   besides I am not sure what's twitters setting on RF, but it
is possible to lose 3 nodes in the same time(facebook once encountered photo
loss because there RAID broken, rarely happen though). I have the strong
willing to set RF to a very high value...

Thanks!


On Sat, Jul 9, 2011 at 5:22 AM, aaron morton wrote:

> AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time
> ago. Twitter is a vocal supporter with a large Apache Cassandra install,
> e.g. "Twitter currently runs a couple hundred Cassandra nodes across a half
> dozen clusters. "
> http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011
>
>
>
> If
> you are working with a 3 node cluster removing/rebuilding/what ever one node
> will effect 33% of your capacity. When you scale up the contribution from
> each individual node goes down, and the impact of one node going down is
> less. Problems that happen with a few nodes will go away at scale, to be
> replaced by a whole set of new ones.
>
>
> 1):  the load balance need to manually performed on every node, according
> to:
>
> Yes
>
> 2): when adding new nodes, need to perform node repair and cleanup on every
> node
>
> You only need to run cleanup, see
> http://wiki.apache.org/cassandra/Operations#Bootstrap
>
> 3) when decommission a node, there is a chance that slow down the entire
> cluster. (not sure why but I saw people ask around about it.) and the only
> way to do is shutdown the entire the cluster, rsync the data, and start all
> nodes without the decommission one.
>
> I cannot remember any specific cases where decommission requires a full
> cluster stop, do you have a link? With regard to slowing down, the
> decommission process will stream data from the node you are removing onto
> the other nodes this can slow down the target node (I think it's more
> intelligent now about what is moved). This will be exaggerated in a 3 node
> cluster as you are removing 33% of the processing and adding some
> (temporary) extra load to the remaining nodes.
>
> after all, I think there is alot of human work to do to maintain the
> cluster which make it impossible to scale to thousands of nodes,
>
> Automation, Automation, Automation is the only way to go.
>
> Chef, Puppet, CF Engine for general config and deployment; Cloud Kick,
> munin, ganglia etc for monitoring. And
> Ops Centre (http://www.datastax.com/products/opscenter) for cassandra
> specific management.
>
> I am totally wrong about all of this, currently I am serving 1 millions pv
> every day with Cassandra and it make me feel unsafe, I am afraid one day one
> node crash will cause the data broken and all cluster goes wrong
>
> With RF3 and a 3Node cluster you have room to lose one node and the cluster
> will be up for 100% of the keys. While better than having to worry about
> *the* database server, it's still entry level fault tolerance. With RF 3 in
> a 6 Node cluster you can lose up to 2 nodes and still be up for 100% of the
> keys.
>
> Is there something you are specifically concerned about with your current
> installation ?
>
> Cheers
>
>   -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 8 Jul 2011, at 08:50, Yan Chunlu wrote:
>
> hi, all:
> I am curious about how large that Cassandra can scale?
>
> from the information I can get, the largest usage is at facebook, which is
> about 150 nodes.  in the mean time they are using 2000+ nodes with Hadoop,
> and yahoo even using 4000 nodes of Hadoop.
>
> I am not understand why is the situation, I only have  little knowledge
> with Cassandra and even no knowledge with Hadoop.
>
>
>
> currently I am using cassandra with 3 nodes and having problem bring one
> back after it out of sync, the problems I encountered making me worry about
> how cassandra could scale out:
>
> 1):  the load balance need to manually performed on every node, according
> to:
>
> def tokens(nodes):
>
> for x in xrange(nodes):
>
> print 2 ** 127 / nodes * x
>
>
>
> 2): when adding new nodes, need to perform node repair and cleanup on every
> node
>
>
>
> 3) when decommission a node, there is a chance that slow down the entire
> cluster. (not sure why but I saw people ask around about i

Re: how large cassandra could scale when it need to do manual operation?

2011-07-09 Thread Chris Goffinet
As mentioned by Aaron, yes we run hundreds of Cassandra nodes across
multiple clusters. We run with RF of 2 and 3 (most common).

We use commodity hardware and see failure all the time at this scale. We've
never had 3 nodes that were in same replica set, fail all at once. We
mitigate risk by being rack diverse, using different vendors for our hard
drives, designed workflows to make sure machines get serviced in certain
time windows and have an extensive automated burn-in process of (disk,
memory, drives) to not roll out nodes/clusters that could fail right away.

On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu  wrote:

> thank you very much for the reply. which brings me more confidence on
> cassandra.
> I will try the automation tools, the examples you've listed seems quite
> promising!
>
>
> about the decommission problem, here is the link:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
>  I am also trying to deploy cassandra across two datacenters(with 20ms
> latency). so I am worrying about the network latency will even make it
> worse.
>
> maybe I was misunderstanding the replication factor, doesn't it RF=3 means
> I could lose two nodes and still have one available(with 100% of the keys),
> once Nodes>=3?   besides I am not sure what's twitters setting on RF, but it
> is possible to lose 3 nodes in the same time(facebook once encountered photo
> loss because there RAID broken, rarely happen though). I have the strong
> willing to set RF to a very high value...
>
> Thanks!
>
>
> On Sat, Jul 9, 2011 at 5:22 AM, aaron morton wrote:
>
>> AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time
>> ago. Twitter is a vocal supporter with a large Apache Cassandra install,
>> e.g. "Twitter currently runs a couple hundred Cassandra nodes across a half
>> dozen clusters. "
>> http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011
>>
>>
>>
>> If
>> you are working with a 3 node cluster removing/rebuilding/what ever one node
>> will effect 33% of your capacity. When you scale up the contribution from
>> each individual node goes down, and the impact of one node going down is
>> less. Problems that happen with a few nodes will go away at scale, to be
>> replaced by a whole set of new ones.
>>
>>
>> 1):  the load balance need to manually performed on every node, according
>> to:
>>
>> Yes
>>
>> 2): when adding new nodes, need to perform node repair and cleanup on
>> every node
>>
>> You only need to run cleanup, see
>> http://wiki.apache.org/cassandra/Operations#Bootstrap
>>
>> 3) when decommission a node, there is a chance that slow down the entire
>> cluster. (not sure why but I saw people ask around about it.) and the only
>> way to do is shutdown the entire the cluster, rsync the data, and start all
>> nodes without the decommission one.
>>
>> I cannot remember any specific cases where decommission requires a full
>> cluster stop, do you have a link? With regard to slowing down, the
>> decommission process will stream data from the node you are removing onto
>> the other nodes this can slow down the target node (I think it's more
>> intelligent now about what is moved). This will be exaggerated in a 3 node
>> cluster as you are removing 33% of the processing and adding some
>> (temporary) extra load to the remaining nodes.
>>
>> after all, I think there is alot of human work to do to maintain the
>> cluster which make it impossible to scale to thousands of nodes,
>>
>> Automation, Automation, Automation is the only way to go.
>>
>> Chef, Puppet, CF Engine for general config and deployment; Cloud Kick,
>> munin, ganglia etc for monitoring. And
>> Ops Centre (http://www.datastax.com/products/opscenter) for cassandra
>> specific management.
>>
>> I am totally wrong about all of this, currently I am serving 1 millions pv
>> every day with Cassandra and it make me feel unsafe, I am afraid one day one
>> node crash will cause the data broken and all cluster goes wrong
>>
>> With RF3 and a 3Node cluster you have room to lose one node and the
>> cluster will be up for 100% of the keys. While better than having to worry
>> about *the* database server, it's still entry level fault tolerance. With RF
>> 3 in a 6 Node cluster you can lose up to 2 nodes and still be up for 100% of
>> the keys.
>>
>> Is there something you are specifically concerned about with your current
>> installation ?
>>
>> Cheers
>>
>>   -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 8 Jul 2011, at 08:50, Yan Chunlu wrote:
>>
>> hi, all:
>> I am curious about how large that Cassandra can scale?
>>
>> from the information I can get, the largest usage is at facebook, which is
>> about 150 nodes.  in the mean time they are using 2000+ nodes with Hadoop,
>> and yahoo even usi

is there a need to backup commit log?

2011-07-09 Thread Boris Yen
Hi,

Let's say if I want to migrate data from one cluster to another cluster, in
addition to snapshots, is there a need to also backup the commit log?

As far as I know, some of the data inside commit log might not have been
flushed to sstable during snapshots, therefore, if I only backup the
sstable, would it be some data lost when I restore these data in the new
cluster?

Regards
Boris


Re: Pre-CassandraSF Happy Hour on Sunday

2011-07-09 Thread Eldad Yamin
Can you please Watchitoo.com (its' free) and broadcast the event?

On Fri, Jul 8, 2011 at 8:54 PM, Richard Low  wrote:

> Hi all,
>
> If you're in San Francisco for CassandraSF on Monday 11th, then come
> and join fellow Cassandra users and committers on Sunday evening.
> Starting at 6:30pm at ThirstyBear, the famous brewing company.  We'll
> have drinks, food and more.
>
> RSVP at Eventbrite: http://pre-cassandrasf-happyhour.eventbrite.com/
>
> Hope you can join us!
>
> --
> Richard Low
> Acunu | http://www.acunu.com | @acunu
>


unsubscribe

2011-07-09 Thread vicent roca daniel
On Sat, Jul 9, 2011 at 2:08 PM, Eldad Yamin  wrote:

> Can you please Watchitoo.com (its' free) and broadcast the event?
>
> On Fri, Jul 8, 2011 at 8:54 PM, Richard Low  wrote:
>
>> Hi all,
>>
>> If you're in San Francisco for CassandraSF on Monday 11th, then come
>> and join fellow Cassandra users and committers on Sunday evening.
>> Starting at 6:30pm at ThirstyBear, the famous brewing company.  We'll
>> have drinks, food and more.
>>
>> RSVP at Eventbrite: http://pre-cassandrasf-happyhour.eventbrite.com/
>>
>> Hope you can join us!
>>
>> --
>> Richard Low
>> Acunu | http://www.acunu.com | @acunu
>>
>
>


Re: is there a need to backup commit log?

2011-07-09 Thread Jonathan Ellis
Flush is done as part of the snapshot, so if you're not doing writes
post-snapshot, there's no need to ship commitlogs.  If you ARE still
doing writes, enable encremental backups to get sstables flushed
during the transfer.

On Sat, Jul 9, 2011 at 6:23 AM, Boris Yen  wrote:
> Hi,
> Let's say if I want to migrate data from one cluster to another cluster, in
> addition to snapshots, is there a need to also backup the commit log?
> As far as I know, some of the data inside commit log might not have been
> flushed to sstable during snapshots, therefore, if I only backup the
> sstable, would it be some data lost when I restore these data in the new
> cluster?
> Regards
> Boris



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Corrupted data

2011-07-09 Thread Peter Schuller
>> - Have you been running repair consistently ?
>
> Nop, only when something breaks

This is unrelated to the problem you were asking about, but if you
never run delete, make sure you are aware of:

http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
http://wiki.apache.org/cassandra/DistributedDeletes


-- 
/ Peter Schuller


Cassandra Secondary index/Twissandra

2011-07-09 Thread Eldad Yamin
Hi,
I have few questions:

*Secondary index*

   1. Is there a limit on the number of columns in a single column family
   that serve as secondary indexes?
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s values is high?


*Twissandra*

   1. Why in the source (or any tutorial I've read):
   The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not
   TimeUUID?

   
https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
   lots of columns?


Thanks!
Eldad


Storing single rows on multiple nodes

2011-07-09 Thread Günter Ladwig
Hi all,

we are currently looking at using Cassandra to store highly skewed RDF data. 
With the indexes we use it may happen that a single row contains up to 20% of 
the whole dataset, meaning that it can grow larger than available disk space on 
single nodes. In [1], it says that this limitation is not likely to change in 
the future, but I was wondering if anybody has looked at this problem? 

One thing that comes to mind is a simple approach to DHT load-balancing [2], 
where keys are assigned to one node of several random alternatives (which means 
that for reading, all these nodes have to be queried). This is a bit similar to 
replication, except, of course, that only one copy of the data is stored. As 
this would require changes to the Cassandra code base, we could "simulate" this 
by randomly choosing one of several predefined suffixes and appending it to a 
key before storing it. By modifying a key this way, we could be somewhat sure 
that it will be stored at a different node. The first solution would certainly 
be preferable.

Any thoughts or experiences? Failing that, maybe someone can give me a pointer 
into the Cassandra code base, where something like the [2] should be 
implemented.

Cheers,
Günter

[1] http://wiki.apache.org/cassandra/CassandraLimitations
[2] Byers at el.: Simple Load Balancing for Distributed Hash Tables, 
http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/

--  

Dipl.-Inform. Günter Ladwig

Karlsruhe Institute of Technology (KIT)
Institute AIFB

Englerstraße 11 (Building 11.40, Room 250)
76131 Karlsruhe, Germany
Phone: +49 721 608-47946
Email: guenter.lad...@kit.edu
Web: www.aifb.kit.edu

KIT – University of the State of Baden-Württemberg and National Large-scale 
Research Center of the Helmholtz Association



smime.p7s
Description: S/MIME cryptographic signature


Re: Storing single rows on multiple nodes

2011-07-09 Thread Dan Kuebrich
Perhaps I misunderstand your proposal, but it seems that even with your
manual key placement schemes, the row would still be huge, no matter what
node it gets placed on.  A better solution might be figuring out how to make
each row into a few smaller ones to get better balancing of load and also
faster reads.

- Can you segment the column(s) of the row into different, predictably-named
rows?
- Or segment into different rows and use a secondary index to find the rows
that are part of a particular RDF?
- And/or compress the RDF data (maybe you're already doing that) to reduce
the impact of large rows?

On Sat, Jul 9, 2011 at 4:27 PM, Günter Ladwig wrote:

> Hi all,
>
> we are currently looking at using Cassandra to store highly skewed RDF
> data. With the indexes we use it may happen that a single row contains up to
> 20% of the whole dataset, meaning that it can grow larger than available
> disk space on single nodes. In [1], it says that this limitation is not
> likely to change in the future, but I was wondering if anybody has looked at
> this problem?
>
> One thing that comes to mind is a simple approach to DHT load-balancing
> [2], where keys are assigned to one node of several random alternatives
> (which means that for reading, all these nodes have to be queried). This is
> a bit similar to replication, except, of course, that only one copy of the
> data is stored. As this would require changes to the Cassandra code base, we
> could "simulate" this by randomly choosing one of several predefined
> suffixes and appending it to a key before storing it. By modifying a key
> this way, we could be somewhat sure that it will be stored at a different
> node. The first solution would certainly be preferable.
>
> Any thoughts or experiences? Failing that, maybe someone can give me a
> pointer into the Cassandra code base, where something like the [2] should be
> implemented.
>
> Cheers,
> Günter
>
> [1] http://wiki.apache.org/cassandra/CassandraLimitations
> [2] Byers at el.: Simple Load Balancing for Distributed Hash Tables,
> http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/
>
> --
>
> Dipl.-Inform. Günter Ladwig
>
> Karlsruhe Institute of Technology (KIT)
> Institute AIFB
>
> Englerstraße 11 (Building 11.40, Room 250)
> 76131 Karlsruhe, Germany
> Phone: +49 721 608-47946
> Email: guenter.lad...@kit.edu
> Web: www.aifb.kit.edu
>
> KIT – University of the State of Baden-Württemberg and National Large-scale
> Research Center of the Helmholtz Association
>
>


Re: Storing single rows on multiple nodes

2011-07-09 Thread Günter Ladwig
Hi,

On 09.07.2011, at 23:37, Dan Kuebrich wrote:

> Perhaps I misunderstand your proposal, but it seems that even with your 
> manual key placement schemes, the row would still be huge, no matter what 
> node it gets placed on.  A better solution might be figuring out how to make 
> each row into a few smaller ones to get better balancing of load and also 
> faster reads.

I probably could have been more clear. The idea is to randomly choose one node 
among several for a single key each time some data is added to a row. Say, a 
particular key is normally assigned to node n. Then, for each write to that 
key, we randomly choose one of the nodes n, n+1, n+2, ..., n+k to write the 
data (or we could choose the node with the least load). Of course, if one wants 
to read all data for that key, all these nodes have to be queried, because each 
node will store a chunk of the data for the key.

> - Can you segment the column(s) of the row into different, predictably-named 
> rows?

Yes and no. We actually do not know in advance which of the rows will be the 
ones that grow that large. However, and this is the second solution I 
described, it would be possible to randomly choose some suffix that is added 
when some data is added to a key. For example, we might have a key "abc" and a 
predefined list of suffixes (1, 2, 3). When adding data instead of writing to 
key "abc", we randomly choose one of the suffixes and then, for example, write 
to "abc1". Of course, the number of suffixes determines how high the 
probability is that the modified keys will actually be stored on different 
nodes.

> - Or segment into different rows and use a secondary index to find the rows 
> that are part of a particular RDF?

This is actually something I hadn't looked at yet, thanks for the pointer!

> - And/or compress the RDF data (maybe you're already doing that) to reduce 
> the impact of large rows?

While compression would certainly help, it does not really change the 
underlying problem, just delay its effect. Actually, I hope that Cassandra 
itself will at some point take care of compression ;) Another problem is that 
you can't actually just increase cluster size to scale to larger datasets, 
because the constraint is the disk space on single nodes.

Cheers,
Günter

> 
> On Sat, Jul 9, 2011 at 4:27 PM, Günter Ladwig  wrote:
> Hi all,
> 
> we are currently looking at using Cassandra to store highly skewed RDF data. 
> With the indexes we use it may happen that a single row contains up to 20% of 
> the whole dataset, meaning that it can grow larger than available disk space 
> on single nodes. In [1], it says that this limitation is not likely to change 
> in the future, but I was wondering if anybody has looked at this problem?
> 
> One thing that comes to mind is a simple approach to DHT load-balancing [2], 
> where keys are assigned to one node of several random alternatives (which 
> means that for reading, all these nodes have to be queried). This is a bit 
> similar to replication, except, of course, that only one copy of the data is 
> stored. As this would require changes to the Cassandra code base, we could 
> "simulate" this by randomly choosing one of several predefined suffixes and 
> appending it to a key before storing it. By modifying a key this way, we 
> could be somewhat sure that it will be stored at a different node. The first 
> solution would certainly be preferable.
> 
> Any thoughts or experiences? Failing that, maybe someone can give me a 
> pointer into the Cassandra code base, where something like the [2] should be 
> implemented.
> 
> Cheers,
> Günter
> 
> [1] http://wiki.apache.org/cassandra/CassandraLimitations
> [2] Byers at el.: Simple Load Balancing for Distributed Hash Tables, 
> http://www.springerlink.com/content/r9r4qcqxc2bmfqmr/
> 
> --
> 
> Dipl.-Inform. Günter Ladwig
> 
> Karlsruhe Institute of Technology (KIT)
> Institute AIFB
> 
> Englerstraße 11 (Building 11.40, Room 250)
> 76131 Karlsruhe, Germany
> Phone: +49 721 608-47946
> Email: guenter.lad...@kit.edu
> Web: www.aifb.kit.edu
> 
> KIT – University of the State of Baden-Württemberg and National Large-scale 
> Research Center of the Helmholtz Association
> 
> 

--  

Dipl.-Inform. Günter Ladwig

Karlsruhe Institute of Technology (KIT)
Institute AIFB

Englerstraße 11 (Building 11.40, Room 250)
76131 Karlsruhe, Germany
Phone: +49 721 608-47946
Email: guenter.lad...@kit.edu
Web: www.aifb.kit.edu

KIT – University of the State of Baden-Württemberg and National Large-scale 
Research Center of the Helmholtz Association



smime.p7s
Description: S/MIME cryptographic signature


Re: Corrupted data

2011-07-09 Thread Héctor Izquierdo Seliva
Hi Peter.

 I have a problem with repair, and it's that it always brings the node
doing the repairs down. I've tried setting index_interval to 5000, and
it still dies with OutOfMemory errors, or even worse, it generates
thousands of tiny sstables before dying.

I've tried like 20 repairs during this week. None of them finished. This
is on a 16GB machine using 12GB heap so it doesn't crash (too early).


El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
> >> - Have you been running repair consistently ?
> >
> > Nop, only when something breaks
> 
> This is unrelated to the problem you were asking about, but if you
> never run delete, make sure you are aware of:
> 
> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
> http://wiki.apache.org/cassandra/DistributedDeletes
> 
> 




Re: node stuck "leaving"

2011-07-09 Thread aaron morton
Check the log on all the machines for ERROR messages. An error on any of the 
nodes could have caused the streaming to hang. nodetool netstats will let you 
know if there is a failed stream. 

AFAIK if you restart the cass service on 1 it will forget it was leaving and 
rejoin in a normal state.

cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 8 Jul 2011, at 16:27, Casey Deccio wrote:

> I've got a node that is stuck "Leaving" the ring.  Running "nodetool 
> decommission" never terminates.  It's been in this state for about a week, 
> and the load has not decreased:
> 
> $ nodetool -h localhost ring
> Address DC  RackStatus State   LoadOwns   
>  Token   
>   
>  Token(bytes[de4075d0a474c4a773efa2891c020529])
> x.x.x.1   datacenter1 rack1   Up Leaving 150.63 GB   33.33%  
> Token(bytes[10956f12b46304bf70412ad0eac14344])
> x.x.x.2   datacenter1 rack1   Up Normal  79.21 GB33.33%  
> Token(bytes[50af14df71eafac7bac60fbc836c6722])
> x.x.x.3   datacenter1 rack1   Up Normal  60.74 GB33.33%  
> Token(bytes[de4075d0a474c4a773efa2891c020529])
> 
> Any ideas?
> 
> Regards,
> Casey



Re: how large cassandra could scale when it need to do manual operation?

2011-07-09 Thread aaron morton
> about the decommission problem, here is the link:  
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
The key part of that post is "and since the second node was under heavy load, 
and not enough ram, it was busy GCing and worked horribly slow" . 

> maybe I was misunderstanding the replication factor, doesn't it RF=3 means I 
> could lose two nodes and still have one available(with 100% of the keys), 
> once Nodes>=3?
When you start losing replicas the CL you use dictates if the cluster is still 
up for 100% of the keys. See http://thelastpickle.com/2011/06/13/Down-For-Me/ 

>  I have the strong willing to set RF to a very high value...
As chris said 3 is about normal, it means the QUORUM CL is only 2 nodes. 

> I am also trying to deploy cassandra across two datacenters(with 20ms 
> latency).

Lookup LOCAL_QUORUM in the wiki

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jul 2011, at 02:01, Chris Goffinet wrote:

> As mentioned by Aaron, yes we run hundreds of Cassandra nodes across multiple 
> clusters. We run with RF of 2 and 3 (most common). 
> 
> We use commodity hardware and see failure all the time at this scale. We've 
> never had 3 nodes that were in same replica set, fail all at once. We 
> mitigate risk by being rack diverse, using different vendors for our hard 
> drives, designed workflows to make sure machines get serviced in certain time 
> windows and have an extensive automated burn-in process of (disk, memory, 
> drives) to not roll out nodes/clusters that could fail right away.
> 
> On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu  wrote:
> thank you very much for the reply. which brings me more confidence on 
> cassandra.
> I will try the automation tools, the examples you've listed seems quite 
> promising!
> 
> 
> about the decommission problem, here is the link:  
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
>  I am also trying to deploy cassandra across two datacenters(with 20ms 
> latency). so I am worrying about the network latency will even make it worse. 
>  
> 
> maybe I was misunderstanding the replication factor, doesn't it RF=3 means I 
> could lose two nodes and still have one available(with 100% of the keys), 
> once Nodes>=3?   besides I am not sure what's twitters setting on RF, but it 
> is possible to lose 3 nodes in the same time(facebook once encountered photo 
> loss because there RAID broken, rarely happen though). I have the strong 
> willing to set RF to a very high value...
> 
> Thanks!
> 
> 
> On Sat, Jul 9, 2011 at 5:22 AM, aaron morton  wrote:
> AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time ago. 
> Twitter is a vocal supporter with a large Apache Cassandra install, e.g. 
> "Twitter currently runs a couple hundred Cassandra nodes across a half dozen 
> clusters. " 
> http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011
> 
> 
> If you are working with a 3 node cluster removing/rebuilding/what ever one 
> node will effect 33% of your capacity. When you scale up the contribution 
> from each individual node goes down, and the impact of one node going down is 
> less. Problems that happen with a few nodes will go away at scale, to be 
> replaced by a whole set of new ones.   
> 
> 
>> 1):  the load balance need to manually performed on every node, according 
>> to: 
> 
> Yes
>   
>> 2): when adding new nodes, need to perform node repair and cleanup on every 
>> node 
> 
> 
> 
> 
> 
> 
> You only need to run cleanup, see 
> http://wiki.apache.org/cassandra/Operations#Bootstrap
> 
> 
> 
> 
> 
> 
> 
>> 3) when decommission a node, there is a chance that slow down the entire 
>> cluster. (not sure why but I saw people ask around about it.) and the only 
>> way to do is shutdown the entire the cluster, rsync the data, and start all 
>> nodes without the decommission one. 
> 
> I cannot remember any specific cases where decommission requires a full 
> cluster stop, do you have a link? With regard to slowing down, the 
> decommission process will stream data from the node you are removing onto the 
> other nodes this can slow down the target node (I think it's more intelligent 
> now about what is moved). This will be exaggerated in a 3 node cluster as you 
> are removing 33% of the processing and adding some (temporary) extra load to 
> the remaining nodes. 
> 
> 
> 
> 
> 
> 
> 
>> after all, I think there is alot of human work to do to maintain the cluster 
>> which make it impossible to scale to thousands of nodes, 
> 
> Automation, Automation, Automation is the only way to go. 
> 
> Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, munin, 
> ganglia etc for monitoring. And 
> 
> 
> 
> 
> 
> 
> Ops Centre (http://www.datastax.com/products/opscenter) for cassandra 
> s

Re: Cassandra Secondary index/Twissandra

2011-07-09 Thread aaron morton
> Is there a limit on the number of columns in a single column family that 
> serve as secondary indexes? 
AFAIK there is no coded limit, however every index is implemented as another 
(hidden) Column Family that inherits the settings of the parent CF. So under 
0.7 you may run out of memory, under 0.8 you may flush  a lot. Also, when an 
indexed column is updated there are potentially 3 operations that have to 
happen: read the old value, delete the old value, write the new value. More 
indexes == more index updating, just like any other database. 
> Does performance decrease (significantly) if the uniqueness of the column’s 
> values is high?
Low cardinality is recommended
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html

> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not 
> TimeUUID?
Probably just to make the demo easier. It's used to order tweets in the user 
and public timelines by the current time 
https://github.com/twissandra/twissandra/blob/master/cass.py#L204

> Does performance decrease (significantly) if the uniqueness of the column’s 
> name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of 
> columns?
Depends on what sort of operations you are doing. Some read operations have to 
pay a constant cost to decode the row level column index, this can be tuned 
though. AFAIK the comparator type has very little to do with the performance. 

Hope that helps. 

-
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jul 2011, at 12:15, Eldad Yamin wrote:

> Hi,
> I have few questions:
> 
> Secondary index
> Is there a limit on the number of columns in a single column family that 
> serve as secondary indexes? 
> Does performance decrease (significantly) if the uniqueness of the column’s 
> values is high?
> 
> Twissandra
> Why in the source (or any tutorial I've read):
> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not 
> TimeUUID?
> https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
> Does performance decrease (significantly) if the uniqueness of the column’s 
> name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of 
> columns?
> 
> Thanks!
> Eldad



Re: Corrupted data

2011-07-09 Thread aaron morton
> Nop, only when something breaks
Unless you've been working at QUORUM life is about to get trickier.  Repair is 
an essential part of running a cassandra cluster, without it you risk data loss 
and dead data coming back to life. 

If you have been writing at QUORUM, so have a reasonable expectation of data 
replication, the normal approach is to happily let scrub skip the rows, after 
scrub has completed a repair will see the data repaired using one of the other 
replicas. That's probably already happened as the scrub process skipped the 
rows when writing them out to the new files. 

Try to run repair. Try running it on a single CF to start with.


Good luck

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jul 2011, at 16:45, Héctor Izquierdo Seliva wrote:

> Hi Peter.
> 
> I have a problem with repair, and it's that it always brings the node
> doing the repairs down. I've tried setting index_interval to 5000, and
> it still dies with OutOfMemory errors, or even worse, it generates
> thousands of tiny sstables before dying.
> 
> I've tried like 20 repairs during this week. None of them finished. This
> is on a 16GB machine using 12GB heap so it doesn't crash (too early).
> 
> 
> El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
 - Have you been running repair consistently ?
>>> 
>>> Nop, only when something breaks
>> 
>> This is unrelated to the problem you were asking about, but if you
>> never run delete, make sure you are aware of:
>> 
>> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>> http://wiki.apache.org/cassandra/DistributedDeletes
>> 
>> 
> 
> 



Re: Corrupted data

2011-07-09 Thread Jonathan Ellis
Sounds like your non-repair workload is using too much of the heap.

Alternatively, you could have a very large supercolumn that causes the
OOM when it is read.

2011/7/9 Héctor Izquierdo Seliva :
> Hi Peter.
>
>  I have a problem with repair, and it's that it always brings the node
> doing the repairs down. I've tried setting index_interval to 5000, and
> it still dies with OutOfMemory errors, or even worse, it generates
> thousands of tiny sstables before dying.
>
> I've tried like 20 repairs during this week. None of them finished. This
> is on a 16GB machine using 12GB heap so it doesn't crash (too early).
>
>
> El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
>> >> - Have you been running repair consistently ?
>> >
>> > Nop, only when something breaks
>>
>> This is unrelated to the problem you were asking about, but if you
>> never run delete, make sure you are aware of:
>>
>> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>> http://wiki.apache.org/cassandra/DistributedDeletes
>>
>>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: is there a need to backup commit log?

2011-07-09 Thread Boris Yen
Thanks. I will try it.

Boris

On Sat, Jul 9, 2011 at 8:44 PM, Jonathan Ellis  wrote:

> Flush is done as part of the snapshot, so if you're not doing writes
> post-snapshot, there's no need to ship commitlogs.  If you ARE still
> doing writes, enable encremental backups to get sstables flushed
> during the transfer.
>
> On Sat, Jul 9, 2011 at 6:23 AM, Boris Yen  wrote:
> > Hi,
> > Let's say if I want to migrate data from one cluster to another cluster,
> in
> > addition to snapshots, is there a need to also backup the commit log?
> > As far as I know, some of the data inside commit log might not have been
> > flushed to sstable during snapshots, therefore, if I only backup the
> > sstable, would it be some data lost when I restore these data in the new
> > cluster?
> > Regards
> > Boris
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: Corrupted data

2011-07-09 Thread Héctor Izquierdo Seliva
All the important stuff is using QUORUM. Normal operation uses around
3-4 GB of heap out of 6. I've also tried running repair on a per CF
basis, and still no luck. I've found it's faster to bootstrap a node
again than repairing it.

Once I have the cluster in a sane state I'll try running a repair as
part of normal operation and see if manages to finish.

Btw, we are not using super columns.

Thanks for the tips

El sáb, 09-07-2011 a las 17:57 -0700, aaron morton escribió:
> > Nop, only when something breaks
> Unless you've been working at QUORUM life is about to get trickier.  Repair 
> is an essential part of running a cassandra cluster, without it you risk data 
> loss and dead data coming back to life. 
> 
> If you have been writing at QUORUM, so have a reasonable expectation of data 
> replication, the normal approach is to happily let scrub skip the rows, after 
> scrub has completed a repair will see the data repaired using one of the 
> other replicas. That's probably already happened as the scrub process skipped 
> the rows when writing them out to the new files. 
> 
> Try to run repair. Try running it on a single CF to start with.
> 
> 
> Good luck
> 
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 9 Jul 2011, at 16:45, Héctor Izquierdo Seliva wrote:
> 
> > Hi Peter.
> > 
> > I have a problem with repair, and it's that it always brings the node
> > doing the repairs down. I've tried setting index_interval to 5000, and
> > it still dies with OutOfMemory errors, or even worse, it generates
> > thousands of tiny sstables before dying.
> > 
> > I've tried like 20 repairs during this week. None of them finished. This
> > is on a 16GB machine using 12GB heap so it doesn't crash (too early).
> > 
> > 
> > El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
>  - Have you been running repair consistently ?
> >>> 
> >>> Nop, only when something breaks
> >> 
> >> This is unrelated to the problem you were asking about, but if you
> >> never run delete, make sure you are aware of:
> >> 
> >> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
> >> http://wiki.apache.org/cassandra/DistributedDeletes
> >> 
> >> 
> > 
> > 
> 




Re: how large cassandra could scale when it need to do manual operation?

2011-07-09 Thread Yan Chunlu
I missed the consistency level part, thanks very much for the explanation.
that is clear enough.

On Sun, Jul 10, 2011 at 7:57 AM, aaron morton wrote:

> about the decommission problem, here is the link:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
>
> The key part of that post is "and since the second node was under heavy
> load, and not enough ram, it was busy GCing and worked horribly slow" .
>
> maybe I was misunderstanding the replication factor, doesn't it RF=3 means
> I could lose two nodes and still have one available(with 100% of the keys),
> once Nodes>=3?
>
> When you start losing replicas the CL you use dictates if the cluster is
> still up for 100% of the keys. See
> http://thelastpickle.com/2011/06/13/Down-For-Me/
>
>  I have the strong willing to set RF to a very high value...
>
> As chris said 3 is about normal, it means the QUORUM CL is only 2 nodes.
>
> I am also trying to deploy cassandra across two datacenters(with 20ms
>> latency).
>>
> Lookup LOCAL_QUORUM in the wiki
>
> Hope that helps.
>
>  -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 9 Jul 2011, at 02:01, Chris Goffinet wrote:
>
> As mentioned by Aaron, yes we run hundreds of Cassandra nodes across
> multiple clusters. We run with RF of 2 and 3 (most common).
>
> We use commodity hardware and see failure all the time at this scale. We've
> never had 3 nodes that were in same replica set, fail all at once. We
> mitigate risk by being rack diverse, using different vendors for our hard
> drives, designed workflows to make sure machines get serviced in certain
> time windows and have an extensive automated burn-in process of (disk,
> memory, drives) to not roll out nodes/clusters that could fail right away.
>
> On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu  wrote:
>
>> thank you very much for the reply. which brings me more confidence on
>> cassandra.
>> I will try the automation tools, the examples you've listed seems quite
>> promising!
>>
>>
>> about the decommission problem, here is the link:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html
>>  I am also trying to deploy cassandra across two datacenters(with 20ms
>> latency). so I am worrying about the network latency will even make it
>> worse.
>>
>> maybe I was misunderstanding the replication factor, doesn't it RF=3 means
>> I could lose two nodes and still have one available(with 100% of the keys),
>> once Nodes>=3?   besides I am not sure what's twitters setting on RF, but it
>> is possible to lose 3 nodes in the same time(facebook once encountered photo
>> loss because there RAID broken, rarely happen though). I have the strong
>> willing to set RF to a very high value...
>>
>> Thanks!
>>
>>
>> On Sat, Jul 9, 2011 at 5:22 AM, aaron morton wrote:
>>
>>> AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time
>>> ago. Twitter is a vocal supporter with a large Apache Cassandra install,
>>> e.g. "Twitter currently runs a couple hundred Cassandra nodes across a half
>>> dozen clusters. "
>>> http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011
>>>
>>>
>>>
>>> If
>>> you are working with a 3 node cluster removing/rebuilding/what ever one node
>>> will effect 33% of your capacity. When you scale up the contribution from
>>> each individual node goes down, and the impact of one node going down is
>>> less. Problems that happen with a few nodes will go away at scale, to be
>>> replaced by a whole set of new ones.
>>>
>>>
>>> 1):  the load balance need to manually performed on every node, according
>>> to:
>>>
>>> Yes
>>>
>>> 2): when adding new nodes, need to perform node repair and cleanup on
>>> every node
>>>
>>> You only need to run cleanup, see
>>> http://wiki.apache.org/cassandra/Operations#Bootstrap
>>>
>>> 3) when decommission a node, there is a chance that slow down the entire
>>> cluster. (not sure why but I saw people ask around about it.) and the only
>>> way to do is shutdown the entire the cluster, rsync the data, and start all
>>> nodes without the decommission one.
>>>
>>> I cannot remember any specific cases where decommission requires a full
>>> cluster stop, do you have a link? With regard to slowing down, the
>>> decommission process will stream data from the node you are removing onto
>>> the other nodes this can slow down the target node (I think it's more
>>> intelligent now about what is moved). This will be exaggerated in a 3 node
>>> cluster as you are removing 33% of the processing and adding some
>>> (temporary) extra load to the remaining nodes.
>>>
>>> after all, I think there is alot of human work to do to maintain the
>>> cluster which make it impossible to scale to thousands of nodes,
>>>
>>> Automation, A