Re: Getting the most-recent version from time-series data

Clint Kelly Fri, 28 Feb 2014 10:05:12 -0800

Hi Tupshin,

BTW, you asked earlier about the number of different distinct "family"
values.  There could easily be millions of different families, each with
many different values.  Right now I see two options:


   1. Query the table once just to get all of the distinct families, then
   do separate queries for each family to get the most-recent version.
   2. Read back all of the versions of all of the families and then filter
   on the client side.

Neither one of these are great solutions, although once we have users
reading back millions of values in a single query, they will have to
indicate (to our software that sits on top of C*) that they are going to
use paging, and then we are going to be doing multiple client / server
operations anyway.  I'd just like to minimize them.  :)

Best regards,
Clint




On Fri, Feb 28, 2014 at 9:47 AM, Clint Kelly <clint.ke...@gmail.com> wrote:

> Hi Tupshin,
>
> Thanks for your help once again, I really appreciate it.  Quick question
> regarding the issue of token-aware routing, etc.  Let's say that I am using
> the table described earlier:
>
>
> CREATE TABLE time_series_stuff (
>   key text,
>   family text,
>   version int,
>   val text,
>   PRIMARY KEY (key, family, version)
> ) WITH CLUSTERING ORDER BY (family ASC, version DESC)
>
> I want to retrieve values for the most-recent version of every family for
> a given key, doing something like:
>
>     SELECT * from time_series_stuff where key='mykey'
>
> but getting only one version per family.
>
> All of this data should live on the same node (or set of replica nodes),
> correct?  I am specifying the partition key here, and I thought that only
> the partition key determined on what physical nodes data exists.
> Therefore, I would think that all of the results from this query would come
> from a single replica node (or set of replica nodes, if the consistency
> level is greater than 1).
>
> Would you mind clarifying?  Thanks a lot!
>
> Best regards,
> Clint
>
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:56 AM, Tupshin Harper <tups...@tupshin.com>wrote:
>
>> And one last clarification. Where I said "stored procedure" earlier, I
>> meant "prepared statement". Sorry for the confusion. Too much typing while
>> tired.
>>
>> -Tupshin
>>
>>
>> On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tups...@tupshin.com>wrote:
>>
>>> I failed to address the matter of not knowing the families in advance.
>>>
>>> I can't really recommend any solution to that other than storing the
>>> list of families in another structure that is readily queryable. I don't
>>> know how many families you are thinking, but if it is in the millions or
>>> more, You might consider constructing another table such as:
>>> CREATE TABLE families (
>>>   key int,
>>>   family text,
>>>   PRIMARY KEY (key, family)
>>> );
>>>
>>>
>>> store your families there, with a knowable set of keys (I suggest
>>> something like the last 3 digits of the md5 hash of the family). So then
>>> you could retrieve your families in nice sized batches
>>> SELECT family FROM id WHERE key=0;
>>> and then do the fan-out selects that I described previously.
>>>
>>> -Tupshin
>>>
>>>
>>> On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tups...@tupshin.com>wrote:
>>>
>>>> Hi Clint,
>>>>
>>>> What you are describing could actually be accomplished with the Thrift
>>>> API and a multiget_slice with a slicerange having a count of 1. Initially I
>>>> was thinking that this was an important feature gap between Thrift and CQL,
>>>> and was going to suggest that it should be implemented (possible syntax is
>>>> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is
>>>> almost a superset of this feature).
>>>>
>>>> But then I was convinced by some colleagues, that with a modern CQL
>>>> driver that is token aware, you are actually better off (in terms of
>>>> latency, throughput, and reliability), by doing each query separately on
>>>> the client.
>>>>
>>>> The reasoning is that if you did this with a single query, it would
>>>> necessarily be sent to a coordinator that wouldn't own most of the data
>>>> that you are looking for. That coordinator would then need to fan out the
>>>> read to all the nodes owning the partitions you are looking for.
>>>>
>>>> Far better to just do it directly on the client. The token aware client
>>>> will send each request for a row straight to a node that owns it. With a
>>>> separate connection open to each node, this is done in parallel from the
>>>> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with
>>>> a stored procedure, very very little additional overhead to the client,
>>>> server, or network.
>>>>
>>>> -Tupshin
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.ke...@gmail.com>wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Let's say that I have a table that looks like the following:
>>>>>
>>>>> CREATE TABLE time_series_stuff (
>>>>>   key text,
>>>>>   family text,
>>>>>   version int,
>>>>>   val text,
>>>>>   PRIMARY KEY (key, family, version)
>>>>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
>>>>>   bloom_filter_fp_chance=0.010000 AND
>>>>>   caching='KEYS_ONLY' AND
>>>>>   comment='' AND
>>>>>   dclocal_read_repair_chance=0.000000 AND
>>>>>   gc_grace_seconds=864000 AND
>>>>>   index_interval=128 AND
>>>>>   read_repair_chance=0.100000 AND
>>>>>   replicate_on_write='true' AND
>>>>>   populate_io_cache_on_flush='false' AND
>>>>>   default_time_to_live=0 AND
>>>>>   speculative_retry='99.0PERCENTILE' AND
>>>>>   memtable_flush_period_in_ms=0 AND
>>>>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>>>>   compression={'sstable_compression': 'LZ4Compressor'};
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff ;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | revenue |       3 | $$$$$$
>>>>>  monday | revenue |       2 |    $$$
>>>>>  monday | revenue |       1 |     $$
>>>>>  monday | revenue |       0 |      $
>>>>>  monday | traffic |       2 | medium
>>>>>  monday | traffic |       1 |  light
>>>>>  monday | traffic |       0 |  heavy
>>>>>
>>>>> (7 rows)
>>>>>
>>>>> Now let's say that I'd like to perform a query that gets me the most
>>>>> recent N versions of "revenue" and "traffic."
>>>>>
>>>>> Is there a CQL query to do this?  Let's say that N=1.  Then I know
>>>>> that I can do:
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>>> family='revenue' limit 1;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | revenue |       3 | $$$$$$
>>>>>
>>>>> (1 rows)
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>>> family='traffic' limit 1;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | traffic |       2 | medium
>>>>>
>>>>> (1 rows)
>>>>>
>>>>> But what if I have lots of "families" and I want to get the most
>>>>> recent N versions of all of them in a single CQL statement.  Is that
>>>>> possible?  Unfortunately I am working on something where the family names
>>>>> and the number of most-recent versions are not known a priori (I am 
>>>>> porting
>>>>> some code that was designed for HBase).
>>>>>
>>>>> Best regards,
>>>>> Clint
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Getting the most-recent version from time-series data

Reply via email to