date:20141206

Re: Replacing a dead node by deleting it and auto_bootstrap'ing a new node (Cassandra 2.0)

2014-12-06 Thread Omri Bahumi

In that case, just don't delete the dead node (what I think you should
do anyways. I'm pretty sure it can't be deleted if you're going to
replace it with "-Dcassandra.replace_address=...").
I was speaking about the case that you _do_ want it replaced. You can
just delete it and bootstrap a new node. I would expect the behaviour
to be the same.

On Sat, Dec 6, 2014 at 1:56 AM, Jaydeep Chovatia
 wrote:
> I think Cassandra gives us control as what we want to do:
> a) If we want to replace a dead node then we should specify
> "-Dcassandra.replace_address=old_node_ipaddress"
> b) If we are adding new nodes (no replacement) then do not specify above
> option and tokens would get assigned randomly.
>
> I can think of a scenario in which your dead node has tons of data and you
> are hopeful on its recovery so you do not want to replace this dead node
> always. Momentarily you might just add a new node to meet the the capacity
> until dead not is fully recovered.
>
> -jaydeep
>
> On Thu, Dec 4, 2014 at 11:30 PM, Omri Bahumi  wrote:
>>
>> I guess Cassandra is aware that it has some replicas not meeting the
>> replication factor. Wouldn't it be nice if a bootstrapping node would
>> get those?
>> Could make things much simpler in the Ops view.
>>
>> What do you think?
>>
>> On Fri, Dec 5, 2014 at 8:31 AM, Jaydeep Chovatia
>>  wrote:
>> > as per my knowledge if you have externally NOT specified
>> > "-Dcassandra.replace_address=old_node_ipaddress" then new tokens
>> > (randomly)
>> > would get assigned to bootstrapping node instead of tokens of dead node.
>> >
>> > -jaydeep
>> >
>> > On Thu, Dec 4, 2014 at 6:50 AM, Omri Bahumi  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I was wondering, how would auto_bootstrap behave in this scenario:
>> >>
>> >> 1. I had a cluster with 3 nodes (RF=2)
>> >> 2. One node died, I deleted it with "nodetool removenode" (+ force)
>> >> 3. A new node launched with "auto_bootstrap: true"
>> >>
>> >> The question is: will the "right" vnodes go to the new node as if it
>> >> was bootstrapped with "-Dcassandra.replace_address=old_node_ipaddress"
>> >> ?
>> >>
>> >> Thanks,
>> >> Omri.
>> >
>> >
>
>



-- 


Omri Bahumi
System Architect, EverythingMe
 om...@everything.me  (+972) 52-4655544  @omribahumi

Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-06 Thread kong

Hi,

I am doing stress test on Datastax Cassandra Community 2.1.2, not using the
provided stress test tool, but use my own stress-test client code instead(I
write some C++ stress test code). My Cassandra cluster is deployed on Amazon
EC2, using the provided Datastax Community AMI( HVM instances ) in the
Datastax document, and I am not using EBS, just using the ephemeral storage
by default. The EC2 type of Cassandra servers are m3.xlarge. I use another
EC2 instance for my stress test client, which is of type r3.8xlarge. Both
the Cassandra sever nodes and stress test client node are in us-east. I test
the Cassandra cluster which is made up of 1 node, 2 nodes, and 4 nodes
separately. I just do INSERT test and SELECT test separately, but the
performance doesn't get linear increment when new nodes are added. Also I
get some weird results. My test results are as follows(I do 1 million
operations and I try to get the best QPS when the max latency is no more
than 200ms, and the latencies are measured from the client side. The QPS is
calculated by total_operations/total_time).



INSERT(write):


Node count

Replication factor

  QPS

Average latency(ms)

Min latency(ms)

.95 latency(ms)

.99 latency(ms)

.999 latency(ms)

Max latency(ms)


1

1

18687

2.08

1.48

2.95

5.74

52.8

205.4


2

1

20793

3.15

0.84

7.71

41.35

88.7

232.7


2

2

22498

3.37

0.86

6.04

36.1

221.5

649.3


4

1

28348

4.38

0.85

8.19

64.51

169.4

251.9


4

3

28631

5.22

0.87

18.68

68.35

167.2

288

   

SELECT(read):


Node count

Replication factor

QPS

Average latency(ms)

Min latency(ms)

.95 latency(ms)

.99 latency(ms)

.999 latency(ms)

Max latency(ms)


1

1

24498

4.01

1.51

7.6

12.51

31.5

129.6


2

1

28219

3.38

0.85

9.5

17.71

39.2

152.2


2

2

35383

4.06

0.87

9.71

21.25

70.3

215.9


4

1

34648

2.78

0.86

6.07

14.94

30.8

134.6


4

3

52932

3.45

0.86

10.81

21.05

37.4

189.1

 

The test data I use is generated randomly, and the schema I use is like (I
use the cqlsh to create the columnfamily/table):

CREATE TABLE table(

id1  varchar,

ts   varchar,

id2  varchar,

msg  varchar,

PRIMARY KEY(id1, ts, id2));

So the fields are all string and I generate each character of the string
randomly, using srand(time(0)) and rand() in C++, so I think my test data
could be uniformly distributed into the Cassandra cluster. And, in my client
stress test code, I use thrift C++ interface, and the basic operation I do
is like:

thrift_client.execute_cql3_query("INSERT INTO table WHERE id1=xxx, ts=xxx,
id2=xxx, msg=xxx"); and thrift_client.execute_cql3_query("SELECT FROM table
WHERE id1=xxx"); 

Each data entry I INSERT of SELECT is of around 100 characters.

On my stress test client, I create several threads to send the read and
write requests, each thread having its own thrift client, and at the
beginning all the thrift clients connect to the Cassandra servers evenly.
For example, I create 160 thrift clients, and each 40 clients of them
connect to one server node, in a 4 node cluster.

 

So, 

1.   Could anyone help me explain my test results? Why does the
performance ( QPS ) just get a little increment when new nodes are added? 

2.   I learn from the materials that, Cassandra has better write
performance than read. But why in my case the read performance is better? 

3.   I also use the OpsCenter to monitor the real-time performance of my
cluster. But when I get the average QPS above, the operations/s provided by
OpsCenter is around 1+ for write peak and 5000+ for read peak.  Why is
my result inconsistent with that from OpsCenter?

4.   Are there any unreasonable things in my test method, such as test
data and QPS calculation?

 

Thank you very much,

Joy

Re: Pros and cons of lots of very small partitions versus fewer larger partitions

2014-12-06 Thread Eric Stevens

B would work better in the case where you need to do sequential or ranged
style reads on the id, particularly if id has any significant sparseness
(eg, id is a timeuuid).  You can compute the buckets and do reads of entire
buckets within your range.  However if you're doing random access by id,
then you'll have a lot of bloom filter true positives (on the partition
key), but where the clustering key still doesn't exist.

We use both types of model for differing situations.  In one our reads are
totally random access, and we just use id as the sole key, in the other we
need to reassemble all objects that happen in a range, but the object ID's
are reasonably sparse, so we have time-bound bucket as the partition key
and the id as the clustering key.

The appropriate density of rows in your partition/bucket will depend on
your typical read patterns, look at striving for some multiple of your
typical read ranges (eg, if you typically would query for all objects
within a day, bucket might be 1 or 2 hours, if you typically query by hour,
perhaps bucket is 10 minutes, etc.).  Practically speaking, depending on
your hardware you'll want to try to keep your partitions under anywhere
from a few hundred kb to a mb if possible just to reduce gc pressure and
improve other operations like repair.

On Fri Dec 05 2014 at 11:04:22 AM DuyHai Doan  wrote:

> Another argument for table A is that it leverages a lot Bloom filter for
> fast lookup. If negative, no disk hit otherwise at most 1 or 2 disk hits
> depending on the fp chance.
>
> Compaction also works better on skinny partition.
>
> On Fri, Dec 5, 2014 at 6:33 PM, Tyler Hobbs  wrote:
>
>>
>> On Fri, Dec 5, 2014 at 11:14 AM, Robert Wille  wrote:
>>
>>>
>>>  And lets say that bucket is computed as id / N. For analysis purposes,
>>> lets assume I have 100 million id’s to store.
>>>
>>>  Table a is obviously going to have a larger bloom filter. That’s a
>>> clear negative.
>>>
>>
>> That's true, *but*, if you frequently ask for rows that do not exist,
>> Table B will have few BF false positives, while Table A will almost always
>> get a "hit" from the BF and have to look into the SSTable to see that the
>> requested row doesn't actually exist.
>>
>>
>>>
>>>  When I request a record, table a will have less data to load from
>>> disk, so that seems like a positive.
>>>
>>
>> Correct.
>>
>>
>>>
>>>  Table a will never have its columns scattered across multiple
>>> SSTables, but table b might. If I only want one row from a partition in
>>> table b, does fragmentation matter (I think probably not, but I’m not sure)?
>>>
>>
>> Yes, fragmentation can matter.  Cassandra knows the min and max
>> clustering column values for each SSTable, so it can use those to narrow
>> down the set of SSTables it needs to read if you request a specific
>> clustering column value.  However, in your example, this isn't likely to
>> narrow things down much, so it will have to check many more SSTables.
>>
>>
>>>
>>>  It’s not clear to me which will fit more efficiently on disk, but I
>>> would guess that table a wins.
>>>
>>
>> They're probably close enough not to matter very much.
>>
>>
>>>
>>>  Smaller partitions means sending less data during repair, but I
>>> suspect that when computing the Merkle tree for the table, more partitions
>>> might mean more overhead, but that’s only a guess. Which one repairs more
>>> efficiently?
>>>
>>
>> Table A repairs more efficiently by far.  Currently repair must repair
>> entire partitions when they differ.  It cannot repair individual rows
>> within a partition.
>>
>>
>>>
>>>  In your opinion, which one is best and why? If you think table b is
>>> best, what would you choose N to be?
>>>
>>
>> Table A, hands down.  Here's why: you should model your tables to fit
>> your queries.  If you're doing a basic K/V lookup, model it like table A.
>> People recommend wide partitions because many (if not most) queries are
>> best served by that type of model, so if you're not using wide partitions,
>> it's a sign that something might be wrong.  However, there are certainly
>> plenty of use cases where single-row partitions are fine.
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>

Re: nodetool repair exception

2014-12-06 Thread Eric Stevens

The official recommendation is 100k:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

I wonder if there's an advantage to this over unlimited if you're running
servers which are dedicated to your Cassandra cluster (which you should be
for anything production).

On Fri Dec 05 2014 at 2:39:24 PM Robert Coli  wrote:

> On Wed, Dec 3, 2014 at 6:37 AM, Rafał Furmański 
> wrote:
>
>> I see “Too many open files” exception in logs, but I’m sure that my limit
>> is now 150k.
>> Should I increase it? What’s the reasonable limit of open files for
>> cassandra?
>
>
> Why provide any limit? ulimit allows "unlimited"?
>
> =Rob
>
>

Re: How to model data to achieve specific data locality

2014-12-06 Thread Eric Stevens

It depends on the size of your data, but if your data is reasonably small,
there should be no trouble including thousands of records on the same
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
ought to work fine.

If the data size per partition exceeds some threshold that represents the
right tradeoff of increasing repair cost, gc pressure, threatening
unbalanced loads, and other issues that come with wide partitions, then you
can subpartition via some means in a manner consistent with your work load,
with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

For example, if seq_type can be processed for a given seq_id in any order,
and you need to be able to locate specific records for a known
seq_id/seq_type pair, you can compute subpartition is computed
deterministically.  Or if you only ever need to read *all* values for a
given seq_id, and the processing order is not important, just randomly
generate a value for subpartition at write time, as long as you can know
all possible values for subpartition.

If the values for the seq_types for a given seq_id must always be processed
in order based on seq_type, then your subpartition calculation would need
to reflect that and place adjacent seq_types in the same partition.  As a
contrived example, say seq_type was an incrementing integer, your
subpartition could be seq_type / 100.

On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:

> I have a data model question. I am trying to figure out how to model the
> data to achieve the best data locality for analytic purpose. Our
> application processes sequences. Each sequence has a unique key in the
> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
> number of seq_types. The typical read is to load a subset of sequences with
> the same seq_id. Naturally I would like to have all the sequences with the
> same seq_id to co-locate on the same node(s).
>
>
> However I can't simply create one partition per seq_id and use seq_id as
> my partition key. That's because:
>
>
> 1. there could be thousands or even more seq_types for each seq_id. It's
> not feasible to include all the seq_types into one table.
>
> 2. each seq_id might have different sets of seq_types.
>
> 3. each application only needs to access a subset of seq_types for a
> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
> prefer only touching the data that's needed.
>
>
> As per above, I think I should use one partition per [seq_id]_[seq_type].
> But how can I archive the data locality on seq_id? One possible approach is
> to override IPartitioner so that I just use part of the field (say 64
> bytes) to get the token (for location) while still using the whole field as
> partition key (for look up). But before heading that direction, I would
> like to see if there are better options out there. Maybe any new or
> upcoming features in C* 3.0?
>
>
> Thanks.
>

Re: Keyspace and table/cf limits

2014-12-06 Thread Eric Stevens

Based on recent conversations with Datastax engineers, the recommendation
is definitely still to run a finite and reasonable set of column families.

The best way I know of to support multitenancy is to include tenant id in
all of your partition keys.

On Fri Dec 05 2014 at 7:39:47 PM Kai Wang  wrote:

> On Fri, Dec 5, 2014 at 4:32 PM, Robert Coli  wrote:
>
>> On Wed, Dec 3, 2014 at 1:54 PM, Raj N  wrote:
>>
>>> The question is more from a multi-tenancy point of view. We wanted to
>>> see if we can have a keyspace per client. Each keyspace may have 50 column
>>> families, but if we have 200 clients, that would be 10,000 column families.
>>> Do you think that's reasonable to support? I know that key cache capacity
>>> is reserved in heap still. Any plans to move it off-heap?
>>>
>>
>> That's an order of magnitude more CFs than I would want to try to operate.
>>
>> But then, I wouldn't want to operate Cassandra multi-tenant AT ALL, so
>> grain of salt.
>>
>> =Rob
>> http://twitter.com/rcolidba
>>
>>
> I don't know if it's still true but Jonathan Ellis wrote in an old post
> saying there's a fixed overhead per cf. Here is the link.
> http://dba.stackexchange.com/a/12413. Even if it's improved since C* 1.0,
> I still don't feel comfortable to scale my system by creating CFs.
>
>

Re: nodetool repair exception

2014-12-06 Thread Tim Heckman

On Sat, Dec 6, 2014 at 8:05 AM, Eric Stevens  wrote:
> The official recommendation is 100k:
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
>
> I wonder if there's an advantage to this over unlimited if you're running
> servers which are dedicated to your Cassandra cluster (which you should be
> for anything production).

There is the potential to have monitoring systems, and other small
agents, running on systems in production. I could see this simply as a
stop-gap to prevent Cassandra from being able to starve the system of
free file descriptors. In theory, if there's not a proper watchdog on
your monitors this could prevent an issue from causing an alert.
However, just a potential advantage I could think of.

Cheers!
-Tim

> On Fri Dec 05 2014 at 2:39:24 PM Robert Coli  wrote:
>>
>> On Wed, Dec 3, 2014 at 6:37 AM, Rafał Furmański 
>> wrote:
>>>
>>> I see “Too many open files” exception in logs, but I’m sure that my limit
>>> is now 150k.
>>> Should I increase it? What’s the reasonable limit of open files for
>>> cassandra?
>>
>>
>> Why provide any limit? ulimit allows "unlimited"?
>>
>> =Rob
>>

Re: How to model data to achieve specific data locality

2014-12-06 Thread Kai Wang

On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:

> It depends on the size of your data, but if your data is reasonably small,
> there should be no trouble including thousands of records on the same
> partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
> ought to work fine.
>
> If the data size per partition exceeds some threshold that represents the
> right tradeoff of increasing repair cost, gc pressure, threatening
> unbalanced loads, and other issues that come with wide partitions, then you
> can subpartition via some means in a manner consistent with your work load,
> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>
> For example, if seq_type can be processed for a given seq_id in any order,
> and you need to be able to locate specific records for a known
> seq_id/seq_type pair, you can compute subpartition is computed
> deterministically.  Or if you only ever need to read *all* values for a
> given seq_id, and the processing order is not important, just randomly
> generate a value for subpartition at write time, as long as you can know
> all possible values for subpartition.
>
> If the values for the seq_types for a given seq_id must always be
> processed in order based on seq_type, then your subpartition calculation
> would need to reflect that and place adjacent seq_types in the same
> partition.  As a contrived example, say seq_type was an incrementing
> integer, your subpartition could be seq_type / 100.
>
> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>
>> I have a data model question. I am trying to figure out how to model the
>> data to achieve the best data locality for analytic purpose. Our
>> application processes sequences. Each sequence has a unique key in the
>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>> number of seq_types. The typical read is to load a subset of sequences with
>> the same seq_id. Naturally I would like to have all the sequences with the
>> same seq_id to co-locate on the same node(s).
>>
>>
>> However I can't simply create one partition per seq_id and use seq_id as
>> my partition key. That's because:
>>
>>
>> 1. there could be thousands or even more seq_types for each seq_id. It's
>> not feasible to include all the seq_types into one table.
>>
>> 2. each seq_id might have different sets of seq_types.
>>
>> 3. each application only needs to access a subset of seq_types for a
>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>> prefer only touching the data that's needed.
>>
>>
>> As per above, I think I should use one partition per [seq_id]_[seq_type].
>> But how can I archive the data locality on seq_id? One possible approach is
>> to override IPartitioner so that I just use part of the field (say 64
>> bytes) to get the token (for location) while still using the whole field as
>> partition key (for look up). But before heading that direction, I would
>> like to see if there are better options out there. Maybe any new or
>> upcoming features in C* 3.0?
>>
>>
>> Thanks.
>>
>
Thanks, Eric.

Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly. Also new seq_types can be added and
old seq_types can be deleted. This means I often need to ALTER TABLE to add
and drop columns. I am not sure if this is a good practice from operation
point of view.

I thought about your subpartition idea. If there are only a few
applications and each one of them uses a subset of seq_types, I can easily
create one table per application since I can compute the subpartition
deterministically as you said. But in my case data scientists need to
easily write new applications using any combination of seq_types of a
seq_id. So I want the data model to be flexible enough to support
applications using any different set of seq_types without creating new
tables, duplicate all the data etc.

-Kai

Re: Keyspace and table/cf limits

2014-12-06 Thread Kai Wang

On Sat, Dec 6, 2014 at 11:22 AM, Eric Stevens  wrote:

> Based on recent conversations with Datastax engineers, the recommendation
> is definitely still to run a finite and reasonable set of column families.
>
> The best way I know of to support multitenancy is to include tenant id in
> all of your partition keys.
>
> On Fri Dec 05 2014 at 7:39:47 PM Kai Wang  wrote:
>
>> On Fri, Dec 5, 2014 at 4:32 PM, Robert Coli  wrote:
>>
>>> On Wed, Dec 3, 2014 at 1:54 PM, Raj N  wrote:
>>>
 The question is more from a multi-tenancy point of view. We wanted to
 see if we can have a keyspace per client. Each keyspace may have 50 column
 families, but if we have 200 clients, that would be 10,000 column families.
 Do you think that's reasonable to support? I know that key cache capacity
 is reserved in heap still. Any plans to move it off-heap?

>>>
>>> That's an order of magnitude more CFs than I would want to try to
>>> operate.
>>>
>>> But then, I wouldn't want to operate Cassandra multi-tenant AT ALL, so
>>> grain of salt.
>>>
>>> =Rob
>>> http://twitter.com/rcolidba
>>>
>>>
>> I don't know if it's still true but Jonathan Ellis wrote in an old post
>> saying there's a fixed overhead per cf. Here is the link.
>> http://dba.stackexchange.com/a/12413. Even if it's improved since C*
>> 1.0, I still don't feel comfortable to scale my system by creating CFs.
>>
>>
I agree with Eric on encoding tenant id into the partition key. It seems OP
wants to use keyspace to achieve client isolation. But I think multitenancy
is too high level as a feature to be put into the database layer. It's
better handled by the application IMO.

Re: Keyspace and table/cf limits

2014-12-06 Thread Jack Krupansky

Generally, limit a Cassandra cluster low hundreds of tables, regardless of 
number of keyspaces. Beyond low hundreds is certainly an “expert” feature and 
requires great care. Sure, maybe you can have 500 or 750 or maybe even 1,000 
tables in a cluster, but don’t be surprised if you start running into memory 
and performance issues.

There is an undocumented method to reduce the table overhead to support more 
tables, but... if you are not expert enough to find it on your own, then you 
are definitely not expert enough to be using it.

-- Jack Krupansky

From: Raj N 
Sent: Tuesday, November 25, 2014 12:07 PM
To: user@cassandra.apache.org 
Subject: Keyspace and table/cf limits

What's the latest on the maximum number of keyspaces and/or tables that one can 
have in Cassandra 2.1.x? 

-Raj

Re: Keyspace and table/cf limits

2014-12-06 Thread Jack Krupansky

There are two categorically distinct forms of multi-tenancy: 1) You control the 
apps and simply want client data isolation, and 2) The client has their own 
apps and doing direct access to the cluster and using access control at the 
table level to isolate the client data.

Using a tenant ID in the partition key is the preferred approach and works well 
for the first use case, but it doesn’t provide the strict isolation of data 
needed for the second use case. Still, try to use that first approach if you 
can.

You should also consider an application layer which would intermediate between 
the tenant clients and the cluster, supplying the tenant ID in the partition 
key. That does add an extra hop for data access, but is a cleaner design.

If you really do need to maintain separate tables and keyspaces, use what I 
call “sharded clusters” – multiple, independent clusters with a hash on the 
user/tenant ID to select which cluster to use, but limit each cluster to low 
hundreds of tables. It is worth noting that if each tenant needs to be isolated 
anyway, there is clearly no need to store independent tenants on the same 
cluster.

You will have to do your own proof of concept implementation to determine what 
table limit works best for your use case.

-- Jack Krupansky

From: Raj N 
Sent: Wednesday, December 3, 2014 4:54 PM
To: user@cassandra.apache.org 
Subject: Re: Keyspace and table/cf limits

The question is more from a multi-tenancy point of view. We wanted to see if we 
can have a keyspace per client. Each keyspace may have 50 column families, but 
if we have 200 clients, that would be 10,000 column families. Do you think 
that's reasonable to support? I know that key cache capacity is reserved in 
heap still. Any plans to move it off-heap? 

-Raj

On Tue, Nov 25, 2014 at 3:10 PM, Robert Coli  wrote:

  On Tue, Nov 25, 2014 at 9:07 AM, Raj N  wrote:

What's the latest on the maximum number of keyspaces and/or tables that one 
can have in Cassandra 2.1.x?

  Most relevant changes lately would be :

  https://issues.apache.org/jira/browse/CASSANDRA-6689

  and
  https://issues.apache.org/jira/browse/CASSANDRA-6694


  Which should meaningfully reduce the amount of heap memtables consume. That 
heap can then be used to support more heap-persistent structures associated 
with many CFs. I have no idea how to estimate the scale of the improvement.

  As a general/meta statement, Cassandra is very multi-threaded, and consumes 
file handles like crazy. How many different query cases do you really want to 
put on one cluster/node? ;D

  =Rob

Re: Keyspace and table/cf limits

2014-12-06 Thread Jason Wee

+1 well said Jack!

On Sun, Dec 7, 2014 at 6:13 AM, Jack Krupansky 
wrote:

>   Generally, limit a Cassandra cluster low hundreds of tables, regardless
> of number of keyspaces. Beyond low hundreds is certainly an “expert”
> feature and requires great care. Sure, maybe you can have 500 or 750 or
> maybe even 1,000 tables in a cluster, but don’t be surprised if you start
> running into memory and performance issues.
>
> There is an undocumented method to reduce the table overhead to support
> more tables, but... if you are not expert enough to find it on your own,
> then you are definitely not expert enough to be using it.
>
> -- Jack Krupansky
>
>  *From:* Raj N 
> *Sent:* Tuesday, November 25, 2014 12:07 PM
> *To:* user@cassandra.apache.org
> *Subject:* Keyspace and table/cf limits
>
>  What's the latest on the maximum number of keyspaces and/or tables that
> one can have in Cassandra 2.1.x?
>
> -Raj
>

Re: Replacing a dead node by deleting it and auto_bootstrap'ing a new node (Cassandra 2.0)

Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

Re: Pros and cons of lots of very small partitions versus fewer larger partitions

Re: nodetool repair exception

Re: How to model data to achieve specific data locality

Re: Keyspace and table/cf limits

Re: nodetool repair exception

Re: How to model data to achieve specific data locality

Re: Keyspace and table/cf limits

Re: Keyspace and table/cf limits

Re: Keyspace and table/cf limits

Re: Keyspace and table/cf limits

12 matches

Site Navigation

Mail list logo

Footer information