It's definitely not true for every use case of a large number of tables, but for many uses where you'd be tempted to do that, adding whatever would have driven your table naming instead as a column in your partition key on a smaller number of tables will meet your needs. This is especially true if you're looking to solve multi-tenancy, unless you let your tenants dynamically drive your schema (which is a separate can of worms).
On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <jack.krupan...@gmail.com> wrote: > I don't think Cassandra was "purposefully developed" for some target > number of tables - there is no evidence of any such an explicit intent. > Instead, it would be fair to say that Cassandra was "not purposefully > developed" with a goal of supporting "large numbers of tables." Sometimes > features and capabilities come for free or as a side effect of the > technologies used, but usually specific features and specific capabilities > (such as large numbers of tables) require explicit intent and explicit > effort. > > One could indeed endeavor to design a data store (I'm not even sure it > would still be considered a database per se) that supported either large > numbers of tables or an additional level of storage model in between table > and row (call it "group" maybe or "sub-table".) But obviously Cassandra was > not designed with that goal in mind. > > Traditionally, a "table" is a defined relation over a set of data. > Relation and data are distinct concepts. And a relation name is not simply > a Java-style "object". A relation (table) name is supposed to represent an > abstraction or entity type, while essentially all of the cases I have heard > of for wanting thousands (or even hundreds) of tables are trying to use > table as more of a container for a group of rows for a specific entity > instance rather than a distinct entity type. Granted, Cassandra is not > obligated to be limited to the relational model, but Cassandra, especially > CQL, is intentionally modeled reasonably closely with the relational model > in terms of the data modeling abstractions even though the storage engine > is designed to scale across nodes. > > You could file a Jira requesting such a feature improvement. And then we > would see if sentiment has shifted over the years. > > The key thing is to offer up a use case that warrants support for large > numbers of tables. So far, it has usually been the case that the perceived > need for separate tables could easily be met using clustering columns of a > single table. > > Seriously, if you guys can define a legitimate use case that can't easily > be handled by a single table, that could get the discussion started. > > -- Jack Krupansky > > On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez < > fernando.jime...@wealth-port.com> wrote: > >> Hi Jack >> >> Being purposefully developed to only handle up to “a few hundred” tables >> is reason enough. I accept that, and likely a use case with many tables was >> never really considered. But I would still like to understand the design >> choices made so perhaps we gain some confidence level in this upper limit >> in the number of tables. The best estimate we have so far is “a few >> hundred” which is a bit vague. >> >> Regarding scaling, I’m not talking about scaling in terms of data volume, >> but on how the data is structured. One thousand tables with one row each is >> the same data volume as one table with one thousand rows, excluding any >> data structures required to maintain the extra tables. But whereas the >> first seems likely to bring a Cassandra cluster to its knees, the second >> will run happily on a single node cluster in a low end machine. >> >> We will design our code to use a single table to avoid having nightmares >> with this issue. But if there is any authoritative documentation on this >> characteristic of Cassandra, I would love to know more. >> >> FJ >> >> >> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupan...@gmail.com> >> wrote: >> >> I don't think there are any "reasons behind it." It is simply empirical >> experience - as reported here. >> >> Cassandra scales in two dimension - number of rows per node and number of >> nodes. If some source of information lead you to believe otherwise, please >> point out the source so that we can endeavor to correct it. >> >> The exact number of rows per node and tables per node will always have to >> be evaluated empirically - a proof of concept implementation, since it all >> depends on the mix of capabilities of your hardware combined with your >> specific data model, your specific data values, your specific access >> patterns, and your specific load. And it also depends on your own personal >> tolerance for degradation of latency and throughput - some people might >> find a given set of performance metrics acceptable while other might not. >> >> -- Jack Krupansky >> >> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez < >> fernando.jime...@wealth-port.com> wrote: >> >>> Hi Tommaso >>> >>> It’s not that I _need_ a large number of tables. This approach maps >>> easily to the problem we are trying to solve, but it’s becoming clear it’s >>> not the right approach. >>> >>> At the moment I’m trying to understand the limitations in Cassandra >>> regarding number of Tables and the reasons behind it. I’ve come to the >>> email list as my Google-foo is not giving me what I’m looking for :( >>> >>> FJ >>> >>> >>> >>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbu...@gmail.com> wrote: >>> >>> Hi Fernando, >>> >>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was >>> a real pain in terms of operations. Repairs were terribly slow, boot of C* >>> slowed down and in general tracking table metrics becomes bit more work. >>> Why do you need this high number of tables? >>> >>> Tommaso >>> >>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez < >>> fernando.jime...@wealth-port.com> wrote: >>> >>>> Hi Jack >>>> >>>> By entry I mean row >>>> >>>> Apologies for the “obsolete terminology”. When I first looked at >>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve >>>> defaulted to the terms I already knew. I will bear it in mind and call them >>>> tables from now on. >>>> >>>> Is there any documentation about this limit? for example, I’d be keen >>>> to know how much memory is consumed per table, and I’m also curious about >>>> the reasons for keeping this in memory. I’m trying to understand the >>>> limitations here, rather than challenge them. >>>> >>>> So far I found nothing in my search, hence why I had to resort to some >>>> “load testing” to see what happens when you push the table count high >>>> >>>> Thanks >>>> FJ >>>> >>>> >>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupan...@gmail.com> >>>> wrote: >>>> >>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what? >>>> >>>> You are using the obsolete terminology of CQL2 and Thrift - column >>>> family. With CQL3 you should be creating "tables". The practical >>>> recommendation of an upper limit of a few hundred tables across all key >>>> spaces remains. >>>> >>>> Technically you can go higher and technically you can reduce the >>>> overhead per table (an undocumented Jira - intentionally undocumented since >>>> it is strongly not recommended), but... it is unlikely that you will be >>>> happy with the results. >>>> >>>> What is the nature of the use case? >>>> >>>> You basically have two choices: an additional cluster column to >>>> distinguish categories of table, or separate clusters for each few hundred >>>> of tables. >>>> >>>> >>>> -- Jack Krupansky >>>> >>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez < >>>> fernando.jime...@wealth-port.com> wrote: >>>> >>>>> Hi all >>>>> >>>>> I have a use case for Cassandra that would require creating a large >>>>> number of column families. I have found references to early versions of >>>>> Cassandra where each column family would require a fixed amount of memory >>>>> on all nodes, effectively imposing an upper limit on the total number of >>>>> CFs. I have also seen rumblings that this may have been fixed in later >>>>> versions. >>>>> >>>>> To put the question to rest, I have setup a DSE sandbox and created >>>>> some code to generate column families populated with 3,000 entries each. >>>>> >>>>> Unfortunately I have now hit this issue: >>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291 >>>>> >>>>> So I will have to retest against Cassandra 3.0 instead >>>>> >>>>> However, I would like to understand the limitations regarding creation >>>>> of column families. >>>>> >>>>> * Is there a practical upper limit? >>>>> * is this a fixed limit, or does it scale as more nodes are added into >>>>> the cluster? >>>>> * Is there a difference between one keyspace with thousands of column >>>>> families, vs thousands of keyspaces with only a few column families each? >>>>> >>>>> I haven’t found any hard evidence/documentation to help me here, but >>>>> if you can point me in the right direction, I will oblige and RTFM away. >>>>> >>>>> Many thanks for your help! >>>>> >>>>> Cheers >>>>> FJ >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >