Re: Re: Dynamic Columns

Peter Lin Thu, 22 Jan 2015 05:38:52 -0800

@jack thanks for taking time to respond. I agree I could totally redesign
and rewrite it to fit in the newer CQL3 model, but are you really
recommending I throw 4 years of work out and completely rewrite code that
works and has been tested?

Ignoring the practical aspects for now and exploring the topic a bit
further. Since not everyone has spent 5+ years designing and building
temporal databases, it's probably good to go over some fundamental theory
at the risk of boring the hell out of everyone.

1. a temporal record ideally should have 1 unique key for the entire life.
I've done other approaches in the past with composite keys in RDBMS and it
sucks. Could it work? Yes, but I already know from first hand experience
how much pain that causes when temporal records need to be moved around, or
when you need to audit the data for law suits.
2. the life of a temporal record may have n versions and no two versions
are guaranteed to be identical in structure and definitely not in content
3. the temporal metadata about each entity like version, previous version,
branch, previous branch, create date, last transaction and version counter
are required for each record. Those are the only required static columns
and they are managed by the framework. User's aren't suppose to manually
screw with those metadata columns, but obviously they could by going into
cqlsh.
4. at any time, import and export of a single temporal record with all
versions and metadata could occur, so optimal storage and design is
critical. for example, if someone wants to copy or move 100,000 records
from one cluster to another cluster and retain the history.
5. the user can query for all or any number of versions of a temporal
record by version number or branch. For example, it's common to get version
10 and do a comparison against version 12. It's just like doing a diff in
SVN, git, cvs, but for business data. Unlike text files, a diff on business
records is a bit more complex, since it's an object graph.
6. versioning and branching of temporal records relies on business logic,
which can be the stock algorithm or defined by the user
7. Saving and retrieving data has to be predictable and quick. This means
ideally all versions are in the same row and on the same node. Pre CQL and
composite keys, storing data in different rows meant it could be on
different nodes. Thankfully with composite keys, Cassandra will use the
first column as the partition key.

In terms of adding dynamic_column_name as part of the composite key, that
isn't ideal in my use case for several reasons.

1. a record might not have any dynamic columns at all. The user decides
this. The only thing the framework requires is an unique key that doesn't
collide. If the user chooses their own key instead of a UUID, the system
checks for collision before saving a new record.

2. we use dynamic columns to provide projections, aka views of a temporal
entity. This means we can extract fields nested deep in the graph and store
it as a dynamic column to avoid reading the entire object. Unlike other
kinds of use cases of dynamic column, the column name and value will vary.
I know it's popular to use dynamic columns to store time series data like
user click stream, but that has the same type.

3. we allow the user to index secondary columns, but on read we always use
the value in the object. We also integrated solr to give us more advanced
indexing features.

4. we provide an object API to make temporal queries easy. It's
modeled/inspired by JPA/Hibernate. We could have invented another text
query language or tried to use tsql2, but an object API feels more
intuitive to me.

Could I fit a square peg into a round hole? Yes, but does that make any
sense? If I was building a whole new temporal database from scratch, I
might do things different. I couldn't use CQL3 back in 2008/2009, so I
couldn't have used it. Aside from all of that, an object API is more
natural for temporal databases. The model really is an object graph and not
separate database tables stitched together. Any change to any part of the
record requires versioning it and handling it correctly. Having built
temporal databases on RDBMS, using SQL meant building a general purpose
object API to make things easier. This is due to the need to be database
agnostic, so we couldn't use the object API that is available in some
databases. Hopefully that helps provide context and details. I don't expect
people to have a deep understanding of temporal database from my ramblings,
given it took me over 8 years to learn all of this stuff.

On Thu, Jan 22, 2015 at 12:51 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Peter,
>
> At least from your description, the proposed use of the clustering column
> name seems at first blush to fully fit the bill. The point is not that the
> resulting clustered primary key is used to reference an object, but that a
> SELECT on the partition key references the entire object, which will be a
> sequence of CQL3 rows in a partition, and then the clustering column key is
> added when you wish to access that specific aspect of the object. What's
> missing? Again, just store the partition key to reference the full object -
> no pollution required!
>
> And please note that any number of clustering columns can be specified, so
> more structured "dynamic columns" can be supported. For example, you could
> have a timestamp as a separate clustering column to maintain temporal state
> of the database. The partition key can also be structured from multiple
> columns as a composite partition key as well.
>
> As far as all these static columns, consider them optional and merely an
> optimization. If you wish to have a 100% opaque object model, you wouldn't
> have any static columns and the only non-primary key column would be the
> blob value field. Every object attribute would be specified using another
> clustering column name and blob value. Presto, everything you need for a
> pure, opaque, fully-generalized object management system - all with just
> CQL3. Maybe we should include such an example in the doc and with the
> project to more strongly emphasize this capability to fully model
> arbitrarily complex object structures - including temporal structures.
>
> Anything else missing?
>
> As a general proposition, you can use the term "clustering column" in CQL3
> wherever you might have used "dynamic column" in Thrift. The point in CQL3
> is not to eliminate a useful feature, dynamic column, but to repackage the
> feature to make a lot more sense for the vast majority of use cases. Maybe
> there are some cases that doesn't exactly fit as well as desired, but feel
> free to specifically identify such cases so that we can elaborate how we
> think they are covered or at least covered well enough for most users.
>
>
> -- Jack Krupansky
>
> On Wed, Jan 21, 2015 at 12:19 PM, Peter Lin <wool...@gmail.com> wrote:
>
>>
>> the example you provided does not work for for my use case.
>>
>>   CREATE TABLE t (
>>     key blob,
>>     static my_static_column_1 int,
>>     static my_static_column_2 float,
>>     static my_static_column_3 blob,
>>     ....,
>>     dynamic_column_name blob,
>>     dynamic_column_value blob,
>>     PRIMARY KEY (key, dynamic_column_name);
>>   )
>>
>> the dynamic column can't be part of the primary key. The temporal entity
>> key can be the default UUID or the user can choose the field in their
>> object. Within our framework, we have concept of temporal links between one
>> or more temporal entities. Poluting the primary key with the dynamic column
>> wouldn't work.
>>
>> Please excuse the confusing RDB comparison. My point is that Cassandra's
>> dynamic column feature is the "unique" feature that makes it better than
>> traditional RDB or newSql like VoltDB for building temporal databases. With
>> databases that require static schema + alter table for managing schema
>> evolution, it makes it harder and results in down time.
>>
>> One of the challenges of data management over time is evolving the data
>> model and making queries simple. If the record is 5 years old, it probably
>> has a difference schema than a record inserted this week. With temporal
>> databases, every update is an insert, so it's a little bit more complex
>> than just "use a blob". There's a whole level of complication with temporal
>> data and CQL3 custom types isn't clear to me. I've read the CQL3
>> documentation on the custom types several times and it is rather poor. It
>> gives me the impression there's still work needed to get custom types in
>> good shape.
>>
>> With regard to examples others have told me, your advice is fair. A few
>> minutes with google and some blogs should pop up. The reason I bring these
>> things up isn't to put down CQL. It's because I care and want to help
>> improve Cassandra by sharing my experience. I consistently recommend new
>> users learn and understand both Thrift and CQL.
>>
>>
>>
>> On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne <sylv...@datastax.com>
>> wrote:
>>
>>> On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin <wool...@gmail.com> wrote:
>>>
>>>> I don't remember other people's examples in detail due to my shitty
>>>> memory, so I'd rather not misquote.
>>>>
>>>
>>> Fair enough, but maybe you shouldn't use "people's examples you don't
>>> remenber" as argument then. Those examples might be wrong or outdated and
>>> that kind of stuff creates confusion for everyone.
>>>
>>>
>>>>
>>>> In my case, I mix static and dynamic columns in a single column family
>>>> with primitives and objects. The objects are temporal object graphs with a
>>>> known type. Doing this type of stuff is basically transparent for me, since
>>>> I'm using thrift and our data modeler generates helper classes. Our tooling
>>>> seamlessly convert the bytes back to the target object. We have a few
>>>> standard static columns related to temporal metadata. At any time, dynamic
>>>> columns can be added and they can be primitives or objects.
>>>>
>>>
>>> I don't see anything in that that cannot be done with CQL. You can mix
>>> static and dynamic columns in CQL thanks to static columns. More precisely,
>>> you can do what you're describing with a table looking a bit like this:
>>>   CREATE TABLE t (
>>>     key blob,
>>>     static my_static_column_1 int,
>>>     static my_static_column_2 float,
>>>     static my_static_column_3 blob,
>>>     ....,
>>>     dynamic_column_name blob,
>>>     dynamic_column_value blob,
>>>     PRIMARY KEY (key, dynamic_column_name);
>>>   )
>>>
>>> And your helper classes will serialize your objects as they probably do
>>> today (if you use a custom comparator, you can do that too). And let it be
>>> clear that I'm not pretending that doing it this way is tremendously
>>> simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
>>> not meaningfully simpler than thriftMy , it's not really harder either (and
>>> in fact, it's actually less verbose with CQL than with raw thrift).
>>>
>>>
>>>>
>>>> For the record, doing this kind of stuff in a relational database sucks
>>>> horribly.
>>>>
>>>
>>> I don't know what that has to do with CQL to be honest. If you're doing
>>> relational with CQL you're doing it wrong. And please note that I'm not
>>> saying CQL is the perfect API for modeling temporal data. But I don't get
>>> how thrift, which is very crude API, is a much better API at that than CQL
>>> (or, again, how it allows you to do things you can't with CQL).
>>>
>>> --
>>> Sylvain
>>>
>>
>>
>

Re: Re: Dynamic Columns

Reply via email to