While Graham's suggestion will let you collapse a bunch of tables into a single one, it'll likely result in so many other problems it won't be worth the effort. I strongly advise against this approach.
First off, different workloads need different tuning. Compaction strategies, gc_grace_seconds, garbage collection, etc. This is very workload specific and you'll quickly find that fixing one person's problem will negatively impact someone else. Nested JSON using maps will not lead to a good data model, from a performance perspective and will limited your flexibility. As CQL becomes more expressive you'll miss out on its querying potential as well as the ability to *easily* query those tables from tools like Spark. You'll also hit the limit of the number of elements in a map, which to my knowledge still exists in current C* versions. If you're truly dealing with a lot of data, you'll be managing one cluster that is thousands of nodes. Managing clusters > 1k is territory that only a handful of people in the world are familiar with. Even the guys at Netflix stick to a couple hundred. Managing multi tenancy for a hundred clients each with different version requirements will be a nightmare from a people perspective. You'll need everyone to be in sync when you upgrade your cluster. This is just a mess, people are in general, pretty bad at this type of thing. Coordinating a hundred application upgrades (say, to use a newer driver version) is pretty much impossible. "off heap" in 2.1 isn't fully off heap. Read http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 for details. If you hit any performance issues, GC, etc, you will take down your entire business instead of just a small portion. Everyone will be impacting everyone else. 1 app's tombstones will cause compaction problems for everyone using that table and be a disaster to try to fix. A side note: you can get away with more than 8GB of memory if you use G1GC. In fact, it only really works if you use > 8GB. Using ParNew & CMS, tuning the JVM is a different story. The following 2 pages are a good read if you're interested in such details. https://issues.apache.org/jira/browse/CASSANDRA-8150 http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html My recommendation: Separate your concerns, put each (or a handful) of applications on each cluster and maintain multiple clusters. Put each application in a different keyspace, model normally. If you need to move an app off onto it's own cluster, do so via setting up a second DC for that keyspace, replicate, then shift over. Jon On Thu, May 28, 2015 at 3:06 AM Graham Sanderson <gra...@vast.com> wrote: > Depending on your use case and data types (for example if you can have a > minimally > Nested Json representation of the objects; > Than you could go with a common map<string,string> representation where > keys are top love object fields and values are valid Json literals as > strings; eg unquoted primitives, quoted strings, unquoted arrays or other > objects > > Each top level field is then independently updatable - which may be > beneficial (and allows you to trivially keep historical versions of objects > of that is a requirement) > > If you are updating the object in its entirety on save then simply store > the entire object in a single cql field, and denormalize any search fields > you may need (which you kinda want to do anyway) > > Sent from my iPhone > > On May 28, 2015, at 1:49 AM, Arun Chaitanya <chaitan64a...@gmail.com> > wrote: > > Hello Jack, > > > Column families? As opposed to tables? Are you using Thrift instead of > CQL3? You should be focusing on the latter, not the former. > We have an ORM developed in our company, which maps each DTO to a column > family. So, we have many column families. We are using CQL3. > > > But either way, the general guidance is that there is no absolute limit > of tables per se, but "low hundreds" is the recommended limit, regardless > of whether how many key spaces they may be divided > > between. More than that is an anti-pattern for Cassandra - maybe you can > make it work for your application, but it isn't recommended. > You want to say that most cassandra users don't have more than 2-300 > column families? Is this achieved through careful data modelling? > > > A successful Cassandra deployment is critically dependent on careful > data modeling - who is responsible for modeling each of these tables, you > and a single, tightly-knit team with very common interests > and very > specific goals and SLAs or many different developers with different > managers with different goals such as SLAs? > The latter. > > > When you say multi-tenant, are you simply saying that each of your > organization's customers has their data segregated, or does each customer > have direct access to the cluster? > Each organization's data is in the same cluster. No customer doesn't have > access to the cluster. > > Thanks, > Arun > > On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> Scalability of Cassandra refers primarily to number of rows and number of >> nodes - to add more data, add more nodes. >> >> Column families? As opposed to tables? Are you using Thrift instead of >> CQL3? You should be focusing on the latter, not the former. >> >> But either way, the general guidance is that there is no absolute limit >> of tables per se, but "low hundreds" is the recommended limit, regardless >> of whether how many key spaces they may be divided between. More than that >> is an anti-pattern for Cassandra - maybe you can make it work for your >> application, but it isn't recommended. >> >> A successful Cassandra deployment is critically dependent on careful data >> modeling - who is responsible for modeling each of these tables, you and a >> single, tightly-knit team with very common interests and very specific >> goals and SLAs or many different developers with different managers with >> different goals such as SLAs? >> >> When you say multi-tenant, are you simply saying that each of your >> organization's customers has their data segregated, or does each customer >> have direct access to the cluster? >> >> >> >> >> >> -- Jack Krupansky >> >> On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya <chaitan64a...@gmail.com >> > wrote: >> >>> Good Day Everyone, >>> >>> I am very happy with the (almost) linear scalability offered by C*. We >>> had a lot of problems with RDBMS. >>> >>> But, I heard that C* has a limit on number of column families that can >>> be created in a single cluster. >>> The reason being each CF stores 1-2 MB on the JVM heap. >>> >>> In our use case, we have about 10000+ CF and we want to support >>> multi-tenancy. >>> (i.e 10000 * no of tenants) >>> >>> We are new to C* and being from RDBMS background, I would like to >>> understand how to tackle this scenario from your advice. >>> >>> Our plan is to use Off-Heap memtable approach. >>> http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 >>> >>> Each node in the cluster has following configuration >>> 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap) >>> IMO, this should be able to support 1000 CF with no(very less) impact on >>> performance and startup time. >>> >>> We tackle multi-tenancy using different keyspaces.(Solution I found on >>> the web) >>> >>> Using this approach we can have 10 clusters doing the job. (We actually >>> are worried about the cost) >>> >>> Can you please help us evaluate this strategy? I want to hear >>> communities opinion on this. >>> >>> My major concerns being, >>> >>> 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 >>> CF right? >>> >>> 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number >>> of column families increase even when we use multiple keyspace. >>> >>> 3. I understand the complexity using multi-cluster for single >>> application. The code base will get tightly coupled with infrastructure. Is >>> this the right approach? >>> >>> Any suggestion is appreciated. >>> >>> Thanks, >>> Arun >>> >> >> >