Re: Practical limit on number of column families

Eric Stevens Tue, 01 Mar 2016 11:50:25 -0800

It's definitely not true for every use case of a large number of tables,
but for many uses where you'd be tempted to do that, adding whatever would
have driven your table naming instead as a column in your partition key on
a smaller number of tables will meet your needs.  This is especially true
if you're looking to solve multi-tenancy, unless you let your tenants
dynamically drive your schema (which is a separate can of worms).


On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> I don't think Cassandra was "purposefully developed" for some target
> number of tables - there is no evidence of any such an explicit intent.
> Instead, it would be fair to say that Cassandra was "not purposefully
> developed" with a goal of supporting "large numbers of tables." Sometimes
> features and capabilities come for free or as a side effect of the
> technologies used, but usually specific features and specific capabilities
> (such as large numbers of tables) require explicit intent and explicit
> effort.
>
> One could indeed endeavor to design a data store (I'm not even sure it
> would still be considered a database per se) that supported either large
> numbers of tables or an additional level of storage model in between table
> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
> not designed with that goal in mind.
>
> Traditionally, a "table" is a defined relation over a set of data.
> Relation and data are distinct concepts. And a relation name is not simply
> a Java-style "object". A relation (table) name is supposed to represent an
> abstraction or entity type, while essentially all of the cases I have heard
> of for wanting thousands (or even hundreds) of tables are trying to use
> table as more of a container for a group of rows for a specific entity
> instance rather than a distinct entity type. Granted, Cassandra is not
> obligated to be limited to the relational model, but Cassandra, especially
> CQL, is intentionally modeled reasonably closely with the relational model
> in terms of the data modeling abstractions even though the storage engine
> is designed to scale across nodes.
>
> You could file a Jira requesting such a feature improvement. And then we
> would see if sentiment has shifted over the years.
>
> The key thing is to offer up a use case that warrants support for large
> numbers of tables. So far, it has usually been the case that the perceived
> need for separate tables could easily be met using clustering columns of a
> single table.
>
> Seriously, if you guys can define a legitimate use case that can't easily
> be handled by a single table, that could get the discussion started.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> Being purposefully developed to only handle up to “a few hundred” tables
>> is reason enough. I accept that, and likely a use case with many tables was
>> never really considered. But I would still like to understand the design
>> choices made so perhaps we gain some confidence level in this upper limit
>> in the number of tables. The best estimate we have so far is “a few
>> hundred” which is a bit vague.
>>
>> Regarding scaling, I’m not talking about scaling in terms of data volume,
>> but on how the data is structured. One thousand tables with one row each is
>> the same data volume as one table with one thousand rows, excluding any
>> data structures required to maintain the extra tables. But whereas the
>> first seems likely to bring a Cassandra cluster to its knees, the second
>> will run happily on a single node cluster in a low end machine.
>>
>> We will design our code to use a single table to avoid having nightmares
>> with this issue. But if there is any authoritative documentation on this
>> characteristic of Cassandra, I would love to know more.
>>
>> FJ
>>
>>
>> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>> I don't think there are any "reasons behind it." It is simply empirical
>> experience - as reported here.
>>
>> Cassandra scales in two dimension - number of rows per node and number of
>> nodes. If some source of information lead you to believe otherwise, please
>> point out the source so that we can endeavor to correct it.
>>
>> The exact number of rows per node and tables per node will always have to
>> be evaluated empirically - a proof of concept implementation, since it all
>> depends on the mix of capabilities of your hardware combined with your
>> specific data model, your specific data values, your specific access
>> patterns, and your specific load. And it also depends on your own personal
>> tolerance for degradation of latency and throughput - some people might
>> find a given set of performance  metrics acceptable while other might not.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:
>>
>>> Hi Tommaso
>>>
>>> It’s not that I _need_ a large number of tables. This approach maps
>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>> not the right approach.
>>>
>>> At the moment I’m trying to understand the limitations in Cassandra
>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>
>>> FJ
>>>
>>>
>>>
>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbu...@gmail.com> wrote:
>>>
>>> Hi Fernando,
>>>
>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>>> a real pain in terms of operations. Repairs were terribly slow, boot of C*
>>> slowed down and in general tracking table metrics becomes bit more work.
>>> Why do you need this high number of tables?
>>>
>>> Tommaso
>>>
>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com> wrote:
>>>
>>>> Hi Jack
>>>>
>>>> By entry I mean row
>>>>
>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>>> tables from now on.
>>>>
>>>> Is there any documentation about this limit? for example, I’d be keen
>>>> to know how much memory is consumed per table, and I’m also curious about
>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>> limitations here, rather than challenge them.
>>>>
>>>> So far I found nothing in my search, hence why I had to resort to some
>>>> “load testing” to see what happens when you push the table count high
>>>>
>>>> Thanks
>>>> FJ
>>>>
>>>>
>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupan...@gmail.com>
>>>> wrote:
>>>>
>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>
>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>> family. With CQL3 you should be creating "tables". The practical
>>>> recommendation of an upper limit of a few hundred tables across all key
>>>> spaces remains.
>>>>
>>>> Technically you can go higher and technically you can reduce the
>>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>> happy with the results.
>>>>
>>>> What is the nature of the use case?
>>>>
>>>> You basically have two choices: an additional cluster column to
>>>> distinguish categories of table, or separate clusters for each few hundred
>>>> of tables.
>>>>
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>> fernando.jime...@wealth-port.com> wrote:
>>>>
>>>>> Hi all
>>>>>
>>>>> I have a use case for Cassandra that would require creating a large
>>>>> number of column families. I have found references to early versions of
>>>>> Cassandra where each column family would require a fixed amount of memory
>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>> versions.
>>>>>
>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>
>>>>> Unfortunately I have now hit this issue:
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>
>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>
>>>>> However, I would like to understand the limitations regarding creation
>>>>> of column families.
>>>>>
>>>>> * Is there a practical upper limit?
>>>>> * is this a fixed limit, or does it scale as more nodes are added into
>>>>> the cluster?
>>>>> * Is there a difference between one keyspace with thousands of column
>>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>>
>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>
>>>>> Many thanks for your help!
>>>>>
>>>>> Cheers
>>>>> FJ
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Practical limit on number of column families

Reply via email to