Re: Practical limit on number of column families

Jack Krupansky Tue, 01 Mar 2016 08:09:40 -0800

I don't think Cassandra was "purposefully developed" for some target number
of tables - there is no evidence of any such an explicit intent. Instead,
it would be fair to say that Cassandra was "not purposefully developed"
with a goal of supporting "large numbers of tables." Sometimes features and
capabilities come for free or as a side effect of the technologies used,
but usually specific features and specific capabilities (such as large
numbers of tables) require explicit intent and explicit effort.


One could indeed endeavor to design a data store (I'm not even sure it
would still be considered a database per se) that supported either large
numbers of tables or an additional level of storage model in between table
and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
not designed with that goal in mind.

Traditionally, a "table" is a defined relation over a set of data. Relation
and data are distinct concepts. And a relation name is not simply a
Java-style "object". A relation (table) name is supposed to represent an
abstraction or entity type, while essentially all of the cases I have heard
of for wanting thousands (or even hundreds) of tables are trying to use
table as more of a container for a group of rows for a specific entity
instance rather than a distinct entity type. Granted, Cassandra is not
obligated to be limited to the relational model, but Cassandra, especially
CQL, is intentionally modeled reasonably closely with the relational model
in terms of the data modeling abstractions even though the storage engine
is designed to scale across nodes.

You could file a Jira requesting such a feature improvement. And then we
would see if sentiment has shifted over the years.

The key thing is to offer up a use case that warrants support for large
numbers of tables. So far, it has usually been the case that the perceived
need for separate tables could easily be met using clustering columns of a
single table.

Seriously, if you guys can define a legitimate use case that can't easily
be handled by a single table, that could get the discussion started.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Jack
>
> Being purposefully developed to only handle up to “a few hundred” tables
> is reason enough. I accept that, and likely a use case with many tables was
> never really considered. But I would still like to understand the design
> choices made so perhaps we gain some confidence level in this upper limit
> in the number of tables. The best estimate we have so far is “a few
> hundred” which is a bit vague.
>
> Regarding scaling, I’m not talking about scaling in terms of data volume,
> but on how the data is structured. One thousand tables with one row each is
> the same data volume as one table with one thousand rows, excluding any
> data structures required to maintain the extra tables. But whereas the
> first seems likely to bring a Cassandra cluster to its knees, the second
> will run happily on a single node cluster in a low end machine.
>
> We will design our code to use a single table to avoid having nightmares
> with this issue. But if there is any authoritative documentation on this
> characteristic of Cassandra, I would love to know more.
>
> FJ
>
>
> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupan...@gmail.com> wrote:
>
> I don't think there are any "reasons behind it." It is simply empirical
> experience - as reported here.
>
> Cassandra scales in two dimension - number of rows per node and number of
> nodes. If some source of information lead you to believe otherwise, please
> point out the source so that we can endeavor to correct it.
>
> The exact number of rows per node and tables per node will always have to
> be evaluated empirically - a proof of concept implementation, since it all
> depends on the mix of capabilities of your hardware combined with your
> specific data model, your specific data values, your specific access
> patterns, and your specific load. And it also depends on your own personal
> tolerance for degradation of latency and throughput - some people might
> find a given set of performance  metrics acceptable while other might not.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Tommaso
>>
>> It’s not that I _need_ a large number of tables. This approach maps
>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>> not the right approach.
>>
>> At the moment I’m trying to understand the limitations in Cassandra
>> regarding number of Tables and the reasons behind it. I’ve come to the
>> email list as my Google-foo is not giving me what I’m looking for :(
>>
>> FJ
>>
>>
>>
>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbu...@gmail.com> wrote:
>>
>> Hi Fernando,
>>
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>> a real pain in terms of operations. Repairs were terribly slow, boot of C*
>> slowed down and in general tracking table metrics becomes bit more work.
>> Why do you need this high number of tables?
>>
>> Tommaso
>>
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:
>>
>>> Hi Jack
>>>
>>> By entry I mean row
>>>
>>> Apologies for the “obsolete terminology”. When I first looked at
>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>> tables from now on.
>>>
>>> Is there any documentation about this limit? for example, I’d be keen to
>>> know how much memory is consumed per table, and I’m also curious about the
>>> reasons for keeping this in memory. I’m trying to understand the
>>> limitations here, rather than challenge them.
>>>
>>> So far I found nothing in my search, hence why I had to resort to some
>>> “load testing” to see what happens when you push the table count high
>>>
>>> Thanks
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupan...@gmail.com>
>>> wrote:
>>>
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>
>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>> family. With CQL3 you should be creating "tables". The practical
>>> recommendation of an upper limit of a few hundred tables across all key
>>> spaces remains.
>>>
>>> Technically you can go higher and technically you can reduce the
>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>> it is strongly not recommended), but... it is unlikely that you will be
>>> happy with the results.
>>>
>>> What is the nature of the use case?
>>>
>>> You basically have two choices: an additional cluster column to
>>> distinguish categories of table, or separate clusters for each few hundred
>>> of tables.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I have a use case for Cassandra that would require creating a large
>>>> number of column families. I have found references to early versions of
>>>> Cassandra where each column family would require a fixed amount of memory
>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>> versions.
>>>>
>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>> some code to generate column families populated with 3,000 entries each.
>>>>
>>>> Unfortunately I have now hit this issue:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>
>>>> So I will have to retest against Cassandra 3.0 instead
>>>>
>>>> However, I would like to understand the limitations regarding creation
>>>> of column families.
>>>>
>>>> * Is there a practical upper limit?
>>>> * is this a fixed limit, or does it scale as more nodes are added into
>>>> the cluster?
>>>> * Is there a difference between one keyspace with thousands of column
>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>
>>>> I haven’t found any hard evidence/documentation to help me here, but if
>>>> you can point me in the right direction, I will oblige and RTFM away.
>>>>
>>>> Many thanks for your help!
>>>>
>>>> Cheers
>>>> FJ
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Practical limit on number of column families

Reply via email to