Re: Storing bi-temporal data in Cassandra

Raj N Fri, 20 Feb 2015 07:23:55 -0800

Thanks for the response Peter. I used the temperature table because its the
most common example on CQL timeseries and I thought I would reuse it. From
some of the responses, looks like I was wrong.


event_time is the time the event happened. So yes it is valid time. I was
trying to see if I can get away with not having valid_from and valid_to in
Cassandra.
transaction_time is the time the database record was written.

Let's take an example -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-18 07:01:00′,’72F’);

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 08:01:00′,’2015-02-18 08:01:00′,’72F’);

And I get an update for the first record tomorrow, I want to keep both
versions. So I would have -

INSERT INTO
temperature(weatherstation_id,event_time,transaction_time,temperature)
VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-*19* 07:01:00′,’72F’);

I fundamentally need to execute 2 types of queries -

1. select the latest values for the weatherstation for a given event time
period which should ideally just return the first and third record.
2. select the values as of particular transaction time(say ’2015-02-18
08:01:00′), in which case I would expect to return first and second record.

About your comment on having valid_time in the keys, do I have a choice in
Cassandra, unless you are suggesting to use secondary indexes.

I am new to bi-temporal data modeling. So please advise if you think
building this on top of Cassandra is a stupid idea.

-Rajesh


On Sun, Feb 15, 2015 at 10:03 AM, Peter Lin <wool...@gmail.com> wrote:

>
> I've built several different bi-temporal databases over the year for a
> variety of applications, so I have to ask "why are you modeling it this
> way?"
>
> Having a temperatures table doesn't make sense to me. Normally a
> bi-temporal database has transaction time and valid time. The transaction
> time is the timestamp of when the data is saved and valid time is usually
> expressed as effective & expiration time.
>
> In the example, is event time suppose to be valid time and what is the
> granularity (seconds, hours, or day)?
> Which kind of queries do you need to query most of the time?
> Why is event_time and transaction_time part of the key?
>
> If you take time to study temporal databases, having valid time in the
> keys will cause a lot of headaches. Basically, it's an anti-pattern for
> temporal databases. The key should only be the unique identifier and it
> shouldn't have transaction or valid time. The use case you describe looks
> more like regular time series data and not really bi-temporal.
>
> I would suggest take time to understand the kinds of queries you need to
> run and then change the table.
>
> peter
>
>
> On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> I had forgotten, but there is a new tuple notation to iterate over more
>> than one clustering column in C* 2.0.6:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>
>> For example,
>>
>> SELECT ... WHERE (c1, c2) > (1, 0)
>>
>> There's an example in the CQL spec:
>> https://cassandra.apache.org/doc/cql3/CQL.html
>>
>>
>> -- Jack Krupansky
>>
>> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <dbros...@mebigfatguy.com>
>> wrote:
>>
>>>  As you point out, there's not really a node-based problem with your
>>> query from a performance point of view. This is a limitation of CQL in
>>> that, cql wants to slice one section of a partition's row (no matter how
>>> big the section is). In your case, you are asking to slice multiple
>>> sections of a partition's row, which currently isn't supported.
>>>
>>> It seems silly perhaps that this is the case, as certainly in your
>>> example it would seem useful, and not to difficult, but the problem is that
>>> you can wind up with n-depth slicing of that partitioned row given an
>>> arbitrary query syntax if range queries on clustering keys was allowed
>>> anywhere.
>>>
>>> At present, you can either duplicate the data using the other clustering
>>> key (transaction_time) as primary clusterer for this use case, or omit
>>> the 3rd criterion (transaction_time = 'xxxx')in the query and get all
>>> the range query results and filter on the client.
>>>
>>> hth,
>>> dave
>>>
>>>
>>>
>>> On 02/14/2015 06:05 PM, Raj N wrote:
>>>
>>> I don't think thats solves my problem. The question really is why can't
>>> we use ranges for both time columns when they are part of the primary key.
>>> They are on 1 row after all. Is this just a CQL limitation?
>>>
>>>  -Raj
>>>
>>> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <doanduy...@gmail.com>
>>> wrote:
>>>
>>>>  "I am trying to get the state as of a particular transaction_time"
>>>>
>>>>   --> In that case you should probably define your primary key in
>>>> another order for clustering columns
>>>>
>>>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>>>
>>>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>> 00:00:00' and transaction_time = 'xxxx'
>>>>
>>>>
>>>>
>>>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <raj.cassan...@gmail.com>
>>>> wrote:
>>>>
>>>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look
>>>>> like I can do this using CQL for now. Taking the time series example from
>>>>> well known modeling tutorials in Cassandra -
>>>>>
>>>>>  CREATE TABLE temperatures (
>>>>> weatherstation_id text,
>>>>> event_time timestamp,
>>>>> temperature text,
>>>>> PRIMARY KEY (weatherstation_id,event_time),
>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>>>
>>>>>  If I add another column transaction_time
>>>>>
>>>>>  CREATE TABLE temperatures (
>>>>> weatherstation_id text,
>>>>> event_time timestamp,
>>>>> transaction_time timestamp,
>>>>>  temperature text,
>>>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>>>
>>>>>  If I try to run a query using the following CQL, it throws an error -
>>>>>
>>>>>  select * from temperatures where weatherstation_id = 'foo' and
>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>>>
>>>>>  It works if I use an equals clause for the event_time. I am trying
>>>>> to get the state as of a particular transaction_time
>>>>>
>>>>>  -Raj
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Storing bi-temporal data in Cassandra

Reply via email to