You are correct that with that schema, all data for a give key would be in a single partition, and hence on the same node(s). I missed that before.
-Tupshin On Fri, Feb 28, 2014 at 12:47 PM, Clint Kelly <clint.ke...@gmail.com> wrote: > Hi Tupshin, > > Thanks for your help once again, I really appreciate it. Quick question > regarding the issue of token-aware routing, etc. Let's say that I am using > the table described earlier: > > > CREATE TABLE time_series_stuff ( > key text, > family text, > version int, > val text, > PRIMARY KEY (key, family, version) > ) WITH CLUSTERING ORDER BY (family ASC, version DESC) > > I want to retrieve values for the most-recent version of every family for > a given key, doing something like: > > SELECT * from time_series_stuff where key='mykey' > > but getting only one version per family. > > All of this data should live on the same node (or set of replica nodes), > correct? I am specifying the partition key here, and I thought that only > the partition key determined on what physical nodes data exists. > Therefore, I would think that all of the results from this query would come > from a single replica node (or set of replica nodes, if the consistency > level is greater than 1). > > Would you mind clarifying? Thanks a lot! > > Best regards, > Clint > > > > > > > On Wed, Feb 26, 2014 at 4:56 AM, Tupshin Harper <tups...@tupshin.com>wrote: > >> And one last clarification. Where I said "stored procedure" earlier, I >> meant "prepared statement". Sorry for the confusion. Too much typing while >> tired. >> >> -Tupshin >> >> >> On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tups...@tupshin.com>wrote: >> >>> I failed to address the matter of not knowing the families in advance. >>> >>> I can't really recommend any solution to that other than storing the >>> list of families in another structure that is readily queryable. I don't >>> know how many families you are thinking, but if it is in the millions or >>> more, You might consider constructing another table such as: >>> CREATE TABLE families ( >>> key int, >>> family text, >>> PRIMARY KEY (key, family) >>> ); >>> >>> >>> store your families there, with a knowable set of keys (I suggest >>> something like the last 3 digits of the md5 hash of the family). So then >>> you could retrieve your families in nice sized batches >>> SELECT family FROM id WHERE key=0; >>> and then do the fan-out selects that I described previously. >>> >>> -Tupshin >>> >>> >>> On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tups...@tupshin.com>wrote: >>> >>>> Hi Clint, >>>> >>>> What you are describing could actually be accomplished with the Thrift >>>> API and a multiget_slice with a slicerange having a count of 1. Initially I >>>> was thinking that this was an important feature gap between Thrift and CQL, >>>> and was going to suggest that it should be implemented (possible syntax is >>>> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is >>>> almost a superset of this feature). >>>> >>>> But then I was convinced by some colleagues, that with a modern CQL >>>> driver that is token aware, you are actually better off (in terms of >>>> latency, throughput, and reliability), by doing each query separately on >>>> the client. >>>> >>>> The reasoning is that if you did this with a single query, it would >>>> necessarily be sent to a coordinator that wouldn't own most of the data >>>> that you are looking for. That coordinator would then need to fan out the >>>> read to all the nodes owning the partitions you are looking for. >>>> >>>> Far better to just do it directly on the client. The token aware client >>>> will send each request for a row straight to a node that owns it. With a >>>> separate connection open to each node, this is done in parallel from the >>>> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with >>>> a stored procedure, very very little additional overhead to the client, >>>> server, or network. >>>> >>>> -Tupshin >>>> >>>> >>>> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.ke...@gmail.com>wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> Let's say that I have a table that looks like the following: >>>>> >>>>> CREATE TABLE time_series_stuff ( >>>>> key text, >>>>> family text, >>>>> version int, >>>>> val text, >>>>> PRIMARY KEY (key, family, version) >>>>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND >>>>> bloom_filter_fp_chance=0.010000 AND >>>>> caching='KEYS_ONLY' AND >>>>> comment='' AND >>>>> dclocal_read_repair_chance=0.000000 AND >>>>> gc_grace_seconds=864000 AND >>>>> index_interval=128 AND >>>>> read_repair_chance=0.100000 AND >>>>> replicate_on_write='true' AND >>>>> populate_io_cache_on_flush='false' AND >>>>> default_time_to_live=0 AND >>>>> speculative_retry='99.0PERCENTILE' AND >>>>> memtable_flush_period_in_ms=0 AND >>>>> compaction={'class': 'SizeTieredCompactionStrategy'} AND >>>>> compression={'sstable_compression': 'LZ4Compressor'}; >>>>> >>>>> cqlsh:fiddle> select * from time_series_stuff ; >>>>> >>>>> key | family | version | val >>>>> --------+---------+---------+-------- >>>>> monday | revenue | 3 | $$$$$$ >>>>> monday | revenue | 2 | $$$ >>>>> monday | revenue | 1 | $$ >>>>> monday | revenue | 0 | $ >>>>> monday | traffic | 2 | medium >>>>> monday | traffic | 1 | light >>>>> monday | traffic | 0 | heavy >>>>> >>>>> (7 rows) >>>>> >>>>> Now let's say that I'd like to perform a query that gets me the most >>>>> recent N versions of "revenue" and "traffic." >>>>> >>>>> Is there a CQL query to do this? Let's say that N=1. Then I know >>>>> that I can do: >>>>> >>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>>>> family='revenue' limit 1; >>>>> >>>>> key | family | version | val >>>>> --------+---------+---------+-------- >>>>> monday | revenue | 3 | $$$$$$ >>>>> >>>>> (1 rows) >>>>> >>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>>>> family='traffic' limit 1; >>>>> >>>>> key | family | version | val >>>>> --------+---------+---------+-------- >>>>> monday | traffic | 2 | medium >>>>> >>>>> (1 rows) >>>>> >>>>> But what if I have lots of "families" and I want to get the most >>>>> recent N versions of all of them in a single CQL statement. Is that >>>>> possible? Unfortunately I am working on something where the family names >>>>> and the number of most-recent versions are not known a priori (I am >>>>> porting >>>>> some code that was designed for HBase). >>>>> >>>>> Best regards, >>>>> Clint >>>>> >>>> >>>> >>> >> >