Re: Dynamic Columns in Cassandra 2.X

Peter Lin Fri, 13 Jun 2014 18:46:28 -0700

yes, thrift does have async, though I haven't had to use it yet.

Right now I'm working on adding CAS to hector followed by multi slice.



On Fri, Jun 13, 2014 at 9:01 PM, graham sanderson <gra...@vast.com> wrote:

> Note as I mentioned mid post, thrift also supports async nowadays (there
> was a recent discussion on cassandra dev and the choice was not to move to
> it)
>
> I think the binary protocol is the way forward; CQL3 needs some new
> features, or there need to be some other types of requests you can make
> over the binary protocol
>
> On Jun 13, 2014, at 5:51 PM, Peter Lin <wool...@gmail.com> wrote:
>
>
> without a doubt there's nice features of CQL3 like notifications and
> async. I want to see CQL3 mature and handle all the use cases that Thrift
> handles easily today. It's to everyone's benefit to work together and
> improve CQL3.
>
> Other benefits of Thrift drivers today is being able to use object API
> with generics. For tool builders, this is especially useful. Not everyone
> wants to write tools, but I do so it matters to me.
>
>
>
>
> On Fri, Jun 13, 2014 at 6:39 PM, Laing, Michael <michael.la...@nytimes.com
> > wrote:
>
>> Just to add 2 more cents... :)
>>
>> The CQL3 protocol is asynchronous. This can provide a substantial
>> throughput increase, according to my benchmarking, when one uses
>> non-blocking techniques.
>>
>> It is also peer-to-peer. Hence the server can generate events to send to
>> the client, e.g. schema changes - in general, 'triggers' become possible.
>>
>> ml
>>
>>
>> On Fri, Jun 13, 2014 at 6:21 PM, graham sanderson <gra...@vast.com>
>> wrote:
>>
>>> My 2 cents…
>>>
>>> A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL
>>> users. This is a valid goal, and works well in many cases.
>>> Equally there are use cases (that some might find ugly) where Cassandra
>>> is chosen explicitly because of the sorts of things you can do at the
>>> thrift level, which aren’t (currently) exposed via CQL3
>>>
>>> To Robert’s point earlier - "Rational people should presume that Thrift
>>> support must eventually disappear”… he is probably right (though frankly
>>> I’d rather the non-blocking thrift version was added instead). However if
>>> we do get rid of the thrift interface, then it needs to be at a time that
>>> CQLn is capable of expressing all the things you could do via the thrift
>>> API. Note, I need to go look and see if the non-blocking thrift version
>>> also requires materializing the entire thrift object in memory.
>>>
>>> On Jun 13, 2014, at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>
>>> There are always the pros and the cons with a querying language, as
>>> always.
>>>
>>> But as far as I can see, the advantages of Thrift I can see over CQL3
>>> are:
>>>
>>>  1) Thrift require a little bit less decoding server-side (a difference
>>> around 10% in CPU usage).
>>>
>>>  2) Thrift use more "compact" storage because CQL3 need to add extra
>>> "marker" columns to guarantee the existence of primary key. It is worsen
>>> when you use clustering columns because for each distinct clustering group
>>> you have a related "marker" columns.
>>>
>>>  That being said, point 1) is not really an issue since most of the time
>>> nodes are more I/O bound than CPU bound. Only in extreme cases where you
>>> have incredible read rate with data that fits entirely in memory that you
>>> may notice the difference.
>>>
>>>  For point 2) this is a small trade-off to have access to a query
>>> language and being able to do slice queries using the WHERE clause. Some
>>> like it, other hate it, it's just a question of taste.  Please note that
>>> the "waste" in disk space is somehow mitigated by compression.
>>>
>>>  Long story short I think Thrift may have appropriate usage but only in
>>> very few use cases. Recently a lot of improvement and features have been
>>> added to CQL3 so that it shoud be considered as the first choice for most
>>> users and if they fall into those few use cases then switch back to Thrift
>>>
>>> My 2 cents
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin <wool...@gmail.com> wrote:
>>>
>>>>
>>>> With text based query approach like CQL, you loose the type with
>>>> dynamic columns. Yes, we're storing it as bytes, but it is simpler and
>>>> easier with Thrift to do these types of things.
>>>>
>>>> I like CQL3 and what it does, but text based query languages make
>>>> certain dynamic schema use cases painful. Having used and built ORM's they
>>>> are poorly suited to dynamic schemas. If you've never had to write an ORM
>>>> to handle dynamic user defined schemas at runtime, it's tough to see where
>>>> the problems arise and how that makes life painful.
>>>>
>>>> Just to be clear, I'm not saying "don't use CQL3" or "CQL3 is bad". I'm
>>>> saying CQL3 is good for certain kinds of use cases and Thrift is good at
>>>> certain use cases. People need to look at what and how they're storing data
>>>> and do what makes the most sense to them. Slavishly following CQL3 doesn't
>>>> make any sense to me.
>>>>
>>>>
>>>>
>>>> On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan <doanduy...@gmail.com>
>>>> wrote:
>>>>
>>>>> "the validation type is set to bytes, and my code is type safe, so it
>>>>> knows which serializers to use. Those dynamic columns are driven off the
>>>>> types in Java."  --> Correct. However, you are still bound by the column
>>>>> comparator type which should be fixed (unless again you set it to bytes, 
>>>>> in
>>>>> this case you loose the ordering and sorting feature)
>>>>>
>>>>>  Basically what you are doing is telling Cassandra to save data in the
>>>>> cells as raw bytes, the serialization is taken care client side using the
>>>>> appropriate serializer. This is perfectly a valid strategy.
>>>>>
>>>>>  But how is it different from using CQL3 and setting the value to
>>>>> "blob" (equivalent to bytes) and take care of the serialization 
>>>>> client-side
>>>>> also ? You can even imagine saving value in JSON format and set the type 
>>>>> to
>>>>> "text".
>>>>>
>>>>>  Really, I don't see why CQL3 cannot achieve the scenario you describe.
>>>>>
>>>>>  For the record, when you create a table in CQL3 as follow:
>>>>>
>>>>>  CREATE TABLE user (
>>>>>      id bigint PRIMARY KEY,
>>>>>      firstname text,
>>>>>      lastname text,
>>>>>      last_connection timestamp,
>>>>>      ....);
>>>>>
>>>>>  C* will create a column family with validation type = bytes to
>>>>> accommodate the timestamp and text types for the firstname, lastname and
>>>>> last_connection columns. Basically the CQL3 engine is doing the
>>>>> serialization server-side for you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 13, 2014 at 11:19 PM, Peter Lin <wool...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> the validation type is set to bytes, and my code is type safe, so it
>>>>>> knows which serializers to use. Those dynamic columns are driven off the
>>>>>> types in Java.
>>>>>>
>>>>>> Having said that, CQL3 does have a new custom type feature, but the
>>>>>> documentation is basically non-existent on how that actually works. One
>>>>>> could also modify CQL such that insert statements gives Cassandra hints
>>>>>> about what type it is, but I'm not aware of anyone enhancing CQL3 to do
>>>>>> that.
>>>>>>
>>>>>> I realize my kind of use case is a bit unique, but I do know of
>>>>>> others that are doing similar kinds of things.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 13, 2014 at 5:11 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In thrift, when creating a column family, you need to define
>>>>>>>
>>>>>>> 1) the row/partition key type
>>>>>>> 2) the column comparator type
>>>>>>> 3) the validation type for the actual value (cell in CQL3
>>>>>>> terminology)
>>>>>>>
>>>>>>> Unless you use "dynamic composites" feature, which does not exist
>>>>>>> (and probably won't) in CQL3, I don't see how you can have columns with
>>>>>>> "different types" on the same row/partition
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 13, 2014 at 11:06 PM, Peter Lin <wool...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> when I say dynamic column, I mean non-static columns of different
>>>>>>>> types within the same row. Some could be an object or one of the 
>>>>>>>> defined
>>>>>>>> datatypes.
>>>>>>>>
>>>>>>>> with thrift I use the appropriate serializer to handle these
>>>>>>>> dynamic columns.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 13, 2014 at 4:55 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Well, before talking and discussing about "dynamic columns", we
>>>>>>>>> should first define it clearly. What do people mean by "dynamic 
>>>>>>>>> columns"
>>>>>>>>> exactly ? Is it the ability to add many columns "of same type" to an
>>>>>>>>> existing physical row?  If yes then CQL3 does support it with 
>>>>>>>>> clustering
>>>>>>>>> columns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 13, 2014 at 10:36 PM, Mark Greene <green...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yeah I don't anticipate more than 1000 properties, well under in
>>>>>>>>>> fact. I guess the trade off of using the clustered columns is that 
>>>>>>>>>> I'd have
>>>>>>>>>> a table that would be tall and skinny which also has its challenges 
>>>>>>>>>> w/r/t
>>>>>>>>>> memory.
>>>>>>>>>>
>>>>>>>>>> I'll look into your suggestion a bit more and consider some
>>>>>>>>>> others around a hybrid of CQL and Thrift (where necssary). But from a
>>>>>>>>>> newb's perspective, I sense the community is unsettled around this 
>>>>>>>>>> concept
>>>>>>>>>> of truly dynamic columns. Coming from an HBase background, it's a
>>>>>>>>>> consideration I didn't anticipate having to evaluate.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 13, 2014 at 4:19 PM, DuyHai Doan <
>>>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mark
>>>>>>>>>>>
>>>>>>>>>>>  I believe that in your table you want to have some "common"
>>>>>>>>>>> fields that will be there whatever customer is, and other fields 
>>>>>>>>>>> that are
>>>>>>>>>>> entirely customer-dependent, isn't it ?
>>>>>>>>>>>
>>>>>>>>>>>  In this case, creating a table with static columns for the
>>>>>>>>>>> common fields and a clustering column representing all custom fields
>>>>>>>>>>> defined by a customer could be a solution (see here for static 
>>>>>>>>>>> column:
>>>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6561 )
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE user_data (
>>>>>>>>>>>    user_id bigint,
>>>>>>>>>>>    user_firstname text static,
>>>>>>>>>>>    user_lastname text static,
>>>>>>>>>>>    ...
>>>>>>>>>>>    custom_property_name text,
>>>>>>>>>>>    custom_property_value text,
>>>>>>>>>>>    PRIMARY KEY(user_id, custom_property_name,
>>>>>>>>>>> custom_property_value));
>>>>>>>>>>>
>>>>>>>>>>>  Please note that with this solution you need to have "at least
>>>>>>>>>>> one" custom property per customer to make it work
>>>>>>>>>>>
>>>>>>>>>>>  The only thing to take care of is the type of
>>>>>>>>>>> custom_property_value. You need to define it once for all. To 
>>>>>>>>>>> accommodate
>>>>>>>>>>> for dynamic types, you can either save the value as blob or text(as 
>>>>>>>>>>> JSON)
>>>>>>>>>>> and take care of the serialization/deserialization yourself at the 
>>>>>>>>>>> client
>>>>>>>>>>> side
>>>>>>>>>>>
>>>>>>>>>>>  As an alternative you can save custom properties in a map,
>>>>>>>>>>> provided that their number is not too large. But considering the 
>>>>>>>>>>> business
>>>>>>>>>>> case of CRM, I believe that it's quite rare and user has more than 
>>>>>>>>>>> 1000
>>>>>>>>>>> custom properties isn't it ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 13, 2014 at 10:03 PM, Mark Greene <
>>>>>>>>>>> green...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> My use case requires the support of arbitrary columns much like
>>>>>>>>>>>> a CRM. My users can define 'custom' fields within the application. 
>>>>>>>>>>>> Ideally
>>>>>>>>>>>> I wouldn't have to change the schema at all, which is why I like 
>>>>>>>>>>>> the old
>>>>>>>>>>>> thrift approach rather than the CQL approach.
>>>>>>>>>>>>
>>>>>>>>>>>> Having said all that, I'd be willing to adapt my API to make
>>>>>>>>>>>> explicit schema changes to Cassandra whenever my user makes a 
>>>>>>>>>>>> change to
>>>>>>>>>>>> their custom fields if that's an accepted practice.
>>>>>>>>>>>>
>>>>>>>>>>>> Ultimately, I'm trying to figure out of the Cassandra community
>>>>>>>>>>>> intends to support true schemaless use cases in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 13, 2014 at 3:47 PM, DuyHai Doan <
>>>>>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This strikes me as bad practice in the world of multi tenant
>>>>>>>>>>>>> systems. I don't want to create a table per customer. So I'm 
>>>>>>>>>>>>> wondering if
>>>>>>>>>>>>> dynamically modifying the table is an accepted practice?  --> Can 
>>>>>>>>>>>>> you give
>>>>>>>>>>>>> some details about your use case ? How would you "alter" a table 
>>>>>>>>>>>>> structure
>>>>>>>>>>>>> to adapt it to a new customer ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wouldn't it be better to model your table so that it supports
>>>>>>>>>>>>> addition/removal of customer ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 13, 2014 at 9:00 PM, Mark Greene <
>>>>>>>>>>>>> green...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks DuyHai,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a follow up question to #2. You mentioned ideally I
>>>>>>>>>>>>>> would create a new table instead of mutating an existing one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This strikes me as bad practice in the world of multi tenant
>>>>>>>>>>>>>> systems. I don't want to create a table per customer. So I'm 
>>>>>>>>>>>>>> wondering if
>>>>>>>>>>>>>> dynamically modifying the table is an accepted practice?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jun 13, 2014 at 2:54 PM, DuyHai Doan <
>>>>>>>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Mark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Dynamic columns, as you said, are perfectly supported by
>>>>>>>>>>>>>>> CQL3 via clustering columns. And no, using collections for 
>>>>>>>>>>>>>>> storing dynamic
>>>>>>>>>>>>>>> data is a very bad idea if the cardinality is very high (>> 
>>>>>>>>>>>>>>> 1000 elements)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1)  Is using Thrift a valid approach in the era of CQL?  -->
>>>>>>>>>>>>>>> Less and less. Unless you are looking for extreme performance, 
>>>>>>>>>>>>>>> you'd better
>>>>>>>>>>>>>>> off choosing CQL3. The ease of programming and querying with 
>>>>>>>>>>>>>>> CQL3 does
>>>>>>>>>>>>>>> worth the small overhead in CPU
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) If CQL is the best practice,  should I alter the schema
>>>>>>>>>>>>>>> at runtime when I detect I need to do an schema mutation?  --> 
>>>>>>>>>>>>>>> Ideally you
>>>>>>>>>>>>>>> should not alter schema but create a new table to adapt to your 
>>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>>> requirements.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) If I utilize CQL collections, will Cassandra page the
>>>>>>>>>>>>>>> entire thing into the heap?  --> Of course. All collections and 
>>>>>>>>>>>>>>> maps in
>>>>>>>>>>>>>>> Cassandra are eagerly loaded entirely in memory on server side. 
>>>>>>>>>>>>>>> That's why
>>>>>>>>>>>>>>> it is recommended to limit their cardinality to ~ 1000 elements
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jun 13, 2014 at 8:33 PM, Mark Greene <
>>>>>>>>>>>>>>> green...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm looking for some best practices w/r/t supporting
>>>>>>>>>>>>>>>> arbitrary columns. It seems from the docs I've read around CQL 
>>>>>>>>>>>>>>>> that they
>>>>>>>>>>>>>>>> are supported in some capacity via collections but you can't 
>>>>>>>>>>>>>>>> exceed 64K in
>>>>>>>>>>>>>>>> size. For my requirements that would cause problems.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So my questions are:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1)  Is using Thrift a valid approach in the era of CQL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) If CQL is the best practice,  should I alter the schema
>>>>>>>>>>>>>>>> at runtime when I detect I need to do an schema mutation?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  3) If I utilize CQL collections, will Cassandra page the
>>>>>>>>>>>>>>>> entire thing into the heap?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My data model is akin to a CRM, arbitrary column
>>>>>>>>>>>>>>>> definitions per customer.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: Dynamic Columns in Cassandra 2.X

Reply via email to