Re: Practical limit on number of column families

Jack Krupansky Tue, 01 Mar 2016 19:23:12 -0800

It is the total table count, across all key spaces. Memory is memory.

-- Jack Krupansky


On Tue, Mar 1, 2016 at 6:26 PM, Brian Sam-Bodden <bsbod...@integrallis.com>
wrote:

> Eric,
>   Is the keyspace as a multitenancy solution as bad as the many tables
> pattern? Is the memory overhead of keyspaces as heavy as that of tables?
>
> Cheers,
> Brian
>
>
> On Tuesday, March 1, 2016, Eric Stevens <migh...@gmail.com> wrote:
>
>> It's definitely not true for every use case of a large number of tables,
>> but for many uses where you'd be tempted to do that, adding whatever would
>> have driven your table naming instead as a column in your partition key on
>> a smaller number of tables will meet your needs.  This is especially true
>> if you're looking to solve multi-tenancy, unless you let your tenants
>> dynamically drive your schema (which is a separate can of worms).
>>
>> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>>> I don't think Cassandra was "purposefully developed" for some target
>>> number of tables - there is no evidence of any such an explicit intent.
>>> Instead, it would be fair to say that Cassandra was "not purposefully
>>> developed" with a goal of supporting "large numbers of tables." Sometimes
>>> features and capabilities come for free or as a side effect of the
>>> technologies used, but usually specific features and specific capabilities
>>> (such as large numbers of tables) require explicit intent and explicit
>>> effort.
>>>
>>> One could indeed endeavor to design a data store (I'm not even sure it
>>> would still be considered a database per se) that supported either large
>>> numbers of tables or an additional level of storage model in between table
>>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>>> not designed with that goal in mind.
>>>
>>> Traditionally, a "table" is a defined relation over a set of data.
>>> Relation and data are distinct concepts. And a relation name is not simply
>>> a Java-style "object". A relation (table) name is supposed to represent an
>>> abstraction or entity type, while essentially all of the cases I have heard
>>> of for wanting thousands (or even hundreds) of tables are trying to use
>>> table as more of a container for a group of rows for a specific entity
>>> instance rather than a distinct entity type. Granted, Cassandra is not
>>> obligated to be limited to the relational model, but Cassandra, especially
>>> CQL, is intentionally modeled reasonably closely with the relational model
>>> in terms of the data modeling abstractions even though the storage engine
>>> is designed to scale across nodes.
>>>
>>> You could file a Jira requesting such a feature improvement. And then we
>>> would see if sentiment has shifted over the years.
>>>
>>> The key thing is to offer up a use case that warrants support for large
>>> numbers of tables. So far, it has usually been the case that the perceived
>>> need for separate tables could easily be met using clustering columns of a
>>> single table.
>>>
>>> Seriously, if you guys can define a legitimate use case that can't
>>> easily be handled by a single table, that could get the discussion started.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com> wrote:
>>>
>>>> Hi Jack
>>>>
>>>> Being purposefully developed to only handle up to “a few hundred”
>>>> tables is reason enough. I accept that, and likely a use case with many
>>>> tables was never really considered. But I would still like to understand
>>>> the design choices made so perhaps we gain some confidence level in this
>>>> upper limit in the number of tables. The best estimate we have so far is “a
>>>> few hundred” which is a bit vague.
>>>>
>>>> Regarding scaling, I’m not talking about scaling in terms of data
>>>> volume, but on how the data is structured. One thousand tables with one row
>>>> each is the same data volume as one table with one thousand rows, excluding
>>>> any data structures required to maintain the extra tables. But whereas the
>>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>>> will run happily on a single node cluster in a low end machine.
>>>>
>>>> We will design our code to use a single table to avoid having
>>>> nightmares with this issue. But if there is any authoritative documentation
>>>> on this characteristic of Cassandra, I would love to know more.
>>>>
>>>> FJ
>>>>
>>>>
>>>> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupan...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think there are any "reasons behind it." It is simply empirical
>>>> experience - as reported here.
>>>>
>>>> Cassandra scales in two dimension - number of rows per node and number
>>>> of nodes. If some source of information lead you to believe otherwise,
>>>> please point out the source so that we can endeavor to correct it.
>>>>
>>>> The exact number of rows per node and tables per node will always have
>>>> to be evaluated empirically - a proof of concept implementation, since it
>>>> all depends on the mix of capabilities of your hardware combined with your
>>>> specific data model, your specific data values, your specific access
>>>> patterns, and your specific load. And it also depends on your own personal
>>>> tolerance for degradation of latency and throughput - some people might
>>>> find a given set of performance  metrics acceptable while other might not.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>>>> fernando.jime...@wealth-port.com> wrote:
>>>>
>>>>> Hi Tommaso
>>>>>
>>>>> It’s not that I _need_ a large number of tables. This approach maps
>>>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>>>> not the right approach.
>>>>>
>>>>> At the moment I’m trying to understand the limitations in Cassandra
>>>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>>>
>>>>> FJ
>>>>>
>>>>>
>>>>>
>>>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Fernando,
>>>>>
>>>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it
>>>>> was a real pain in terms of operations. Repairs were terribly slow, boot 
>>>>> of
>>>>> C* slowed down and in general tracking table metrics becomes bit more 
>>>>> work.
>>>>> Why do you need this high number of tables?
>>>>>
>>>>> Tommaso
>>>>>
>>>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>>>> fernando.jime...@wealth-port.com> wrote:
>>>>>
>>>>>> Hi Jack
>>>>>>
>>>>>> By entry I mean row
>>>>>>
>>>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>>>> defaulted to the terms I already knew. I will bear it in mind and call 
>>>>>> them
>>>>>> tables from now on.
>>>>>>
>>>>>> Is there any documentation about this limit? for example, I’d be keen
>>>>>> to know how much memory is consumed per table, and I’m also curious about
>>>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>>>> limitations here, rather than challenge them.
>>>>>>
>>>>>> So far I found nothing in my search, hence why I had to resort to
>>>>>> some “load testing” to see what happens when you push the table count 
>>>>>> high
>>>>>>
>>>>>> Thanks
>>>>>> FJ
>>>>>>
>>>>>>
>>>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupan...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>>>
>>>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>>>> family. With CQL3 you should be creating "tables". The practical
>>>>>> recommendation of an upper limit of a few hundred tables across all key
>>>>>> spaces remains.
>>>>>>
>>>>>> Technically you can go higher and technically you can reduce the
>>>>>> overhead per table (an undocumented Jira - intentionally undocumented 
>>>>>> since
>>>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>>>> happy with the results.
>>>>>>
>>>>>> What is the nature of the use case?
>>>>>>
>>>>>> You basically have two choices: an additional cluster column to
>>>>>> distinguish categories of table, or separate clusters for each few 
>>>>>> hundred
>>>>>> of tables.
>>>>>>
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>>>> fernando.jime...@wealth-port.com> wrote:
>>>>>>
>>>>>>> Hi all
>>>>>>>
>>>>>>> I have a use case for Cassandra that would require creating a large
>>>>>>> number of column families. I have found references to early versions of
>>>>>>> Cassandra where each column family would require a fixed amount of 
>>>>>>> memory
>>>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>>>> versions.
>>>>>>>
>>>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>>>
>>>>>>> Unfortunately I have now hit this issue:
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>>>
>>>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>>>
>>>>>>> However, I would like to understand the limitations regarding
>>>>>>> creation of column families.
>>>>>>>
>>>>>>> * Is there a practical upper limit?
>>>>>>> * is this a fixed limit, or does it scale as more nodes are added
>>>>>>> into the cluster?
>>>>>>> * Is there a difference between one keyspace with thousands of
>>>>>>> column families, vs thousands of keyspaces with only a few column 
>>>>>>> families
>>>>>>> each?
>>>>>>>
>>>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>>>
>>>>>>> Many thanks for your help!
>>>>>>>
>>>>>>> Cheers
>>>>>>> FJ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --
> Cheers,
> Brian
> http://www.integrallis.com
>
>

Re: Practical limit on number of column families

Reply via email to