Re: 10000+ CF support from Cassandra

Jonathan Haddad Thu, 28 May 2015 09:45:04 -0700

While Graham's suggestion will let you collapse a bunch of tables into a
single one, it'll likely result in so many other problems it won't be worth
the effort.  I strongly advise against this approach.


First off, different workloads need different tuning.  Compaction
strategies, gc_grace_seconds, garbage collection, etc.  This is very
workload specific and you'll quickly find that fixing one person's problem
will negatively impact someone else.

Nested JSON using maps will not lead to a good data model, from a
performance perspective and will limited your flexibility.  As CQL becomes
more expressive you'll miss out on its querying potential as well as the
ability to *easily* query those tables from tools like Spark.  You'll also
hit the limit of the number of elements in a map, which to my knowledge
still exists in current C* versions.

If you're truly dealing with a lot of data, you'll be managing one cluster
that is thousands of nodes.  Managing clusters > 1k is territory that only
a handful of people in the world are familiar with.  Even the guys at
Netflix stick to a couple hundred.

Managing multi tenancy for a hundred clients each with different version
requirements will be a nightmare from a people perspective.  You'll need
everyone to be in sync when you upgrade your cluster.  This is just a mess,
people are in general, pretty bad at this type of thing.  Coordinating a
hundred application upgrades (say, to use a newer driver version) is pretty
much impossible.

"off heap" in 2.1 isn't fully off heap.  Read
http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 for
details.

If you hit any performance issues, GC, etc, you will take down your entire
business instead of just a small portion.  Everyone will be impacting
everyone else.  1 app's tombstones will cause compaction problems for
everyone using that table and be a disaster to try to fix.

A side note: you can get away with more than 8GB of memory if you use
G1GC.  In fact, it only really works if you use > 8GB.  Using ParNew & CMS,
tuning the JVM is a different story.  The following 2 pages are a good read
if you're interested in such details.

https://issues.apache.org/jira/browse/CASSANDRA-8150
http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html

My recommendation: Separate your concerns, put each (or a handful) of
applications on each cluster and maintain multiple clusters.  Put each
application in a different keyspace, model normally.  If you need to move
an app off onto it's own cluster, do so via setting up a second DC for that
keyspace, replicate, then shift over.

Jon


On Thu, May 28, 2015 at 3:06 AM Graham Sanderson <gra...@vast.com> wrote:

> Depending on your use case and data types (for example if you can have a
> minimally
> Nested Json representation of the objects;
> Than you could go with a common map<string,string> representation where
> keys are top love object fields and values are valid Json literals as
> strings; eg unquoted primitives, quoted strings, unquoted arrays or other
> objects
>
> Each top level field is then independently updatable - which may be
> beneficial (and allows you to trivially keep historical versions of objects
> of that is a requirement)
>
> If you are updating the object in its entirety on save then simply store
> the entire object in a single cql field, and denormalize any search fields
> you may need (which you kinda want to do anyway)
>
> Sent from my iPhone
>
> On May 28, 2015, at 1:49 AM, Arun Chaitanya <chaitan64a...@gmail.com>
> wrote:
>
> Hello Jack,
>
> > Column families? As opposed to tables? Are you using Thrift instead of
> CQL3? You should be focusing on the latter, not the former.
> We have an ORM developed in our company, which maps each DTO to a column
> family. So, we have many column families. We are using CQL3.
>
> > But either way, the general guidance is that there is no absolute limit
> of tables per se, but "low hundreds" is the recommended limit, regardless
> of whether how many key spaces they may be divided
> > between. More than that is an anti-pattern for Cassandra - maybe you can
> make it work for your application, but it isn't recommended.
> You want to say that most cassandra users don't have more than 2-300
> column families? Is this achieved through careful data modelling?
>
> > A successful Cassandra deployment is critically dependent on careful
> data modeling - who is responsible for modeling each of these tables, you
> and a single, tightly-knit team with very common interests > and very
> specific goals and SLAs or many different developers with different
> managers with different goals such as SLAs?
> The latter.
>
> > When you say multi-tenant, are you simply saying that each of your
> organization's customers has their data segregated, or does each customer
> have direct access to the cluster?
> Each organization's data is in the same cluster. No customer doesn't have
> access to the cluster.
>
> Thanks,
> Arun
>
> On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Scalability of Cassandra refers primarily to number of rows and number of
>> nodes - to add more data, add more nodes.
>>
>> Column families? As opposed to tables? Are you using Thrift instead of
>> CQL3? You should be focusing on the latter, not the former.
>>
>> But either way, the general guidance is that there is no absolute limit
>> of tables per se, but "low hundreds" is the recommended limit, regardless
>> of whether how many key spaces they may be divided between. More than that
>> is an anti-pattern for Cassandra - maybe you can make it work for your
>> application, but it isn't recommended.
>>
>> A successful Cassandra deployment is critically dependent on careful data
>> modeling - who is responsible for modeling each of these tables, you and a
>> single, tightly-knit team with very common interests and very specific
>> goals and SLAs or many different developers with different managers with
>> different goals such as SLAs?
>>
>> When you say multi-tenant, are you simply saying that each of your
>> organization's customers has their data segregated, or does each customer
>> have direct access to the cluster?
>>
>>
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya <chaitan64a...@gmail.com
>> > wrote:
>>
>>> Good Day Everyone,
>>>
>>> I am very happy with the (almost) linear scalability offered by C*. We
>>> had a lot of problems with RDBMS.
>>>
>>> But, I heard that C* has a limit on number of column families that can
>>> be created in a single cluster.
>>> The reason being each CF stores 1-2 MB on the JVM heap.
>>>
>>> In our use case, we have about 10000+ CF and we want to support
>>> multi-tenancy.
>>> (i.e 10000 * no of tenants)
>>>
>>> We are new to C* and being from RDBMS background, I would like to
>>> understand how to tackle this scenario from your advice.
>>>
>>> Our plan is to use Off-Heap memtable approach.
>>> http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1
>>>
>>> Each node in the cluster has following configuration
>>> 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap)
>>> IMO, this should be able to support 1000 CF with no(very less) impact on
>>> performance and startup time.
>>>
>>> We tackle multi-tenancy using different keyspaces.(Solution I found on
>>> the web)
>>>
>>> Using this approach we can have 10 clusters doing the job. (We actually
>>> are worried about the cost)
>>>
>>> Can you please help us evaluate this strategy? I want to hear
>>> communities opinion on this.
>>>
>>> My major concerns being,
>>>
>>> 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000
>>> CF right?
>>>
>>> 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number
>>> of column families increase even when we use multiple keyspace.
>>>
>>> 3. I understand the complexity using multi-cluster for single
>>> application. The code base will get tightly coupled with infrastructure. Is
>>> this the right approach?
>>>
>>> Any suggestion is appreciated.
>>>
>>> Thanks,
>>> Arun
>>>
>>
>>
>

Re: 10000+ CF support from Cassandra

Reply via email to