Hi Tupshin, Thanks for your help once again, I really appreciate it. Quick question regarding the issue of token-aware routing, etc. Let's say that I am using the table described earlier:
CREATE TABLE time_series_stuff ( key text, family text, version int, val text, PRIMARY KEY (key, family, version) ) WITH CLUSTERING ORDER BY (family ASC, version DESC) I want to retrieve values for the most-recent version of every family for a given key, doing something like: SELECT * from time_series_stuff where key='mykey' but getting only one version per family. All of this data should live on the same node (or set of replica nodes), correct? I am specifying the partition key here, and I thought that only the partition key determined on what physical nodes data exists. Therefore, I would think that all of the results from this query would come from a single replica node (or set of replica nodes, if the consistency level is greater than 1). Would you mind clarifying? Thanks a lot! Best regards, Clint On Wed, Feb 26, 2014 at 4:56 AM, Tupshin Harper <tups...@tupshin.com> wrote: > And one last clarification. Where I said "stored procedure" earlier, I > meant "prepared statement". Sorry for the confusion. Too much typing while > tired. > > -Tupshin > > > On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tups...@tupshin.com>wrote: > >> I failed to address the matter of not knowing the families in advance. >> >> I can't really recommend any solution to that other than storing the list >> of families in another structure that is readily queryable. I don't know >> how many families you are thinking, but if it is in the millions or more, >> You might consider constructing another table such as: >> CREATE TABLE families ( >> key int, >> family text, >> PRIMARY KEY (key, family) >> ); >> >> >> store your families there, with a knowable set of keys (I suggest >> something like the last 3 digits of the md5 hash of the family). So then >> you could retrieve your families in nice sized batches >> SELECT family FROM id WHERE key=0; >> and then do the fan-out selects that I described previously. >> >> -Tupshin >> >> >> On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tups...@tupshin.com>wrote: >> >>> Hi Clint, >>> >>> What you are describing could actually be accomplished with the Thrift >>> API and a multiget_slice with a slicerange having a count of 1. Initially I >>> was thinking that this was an important feature gap between Thrift and CQL, >>> and was going to suggest that it should be implemented (possible syntax is >>> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is almost >>> a superset of this feature). >>> >>> But then I was convinced by some colleagues, that with a modern CQL >>> driver that is token aware, you are actually better off (in terms of >>> latency, throughput, and reliability), by doing each query separately on >>> the client. >>> >>> The reasoning is that if you did this with a single query, it would >>> necessarily be sent to a coordinator that wouldn't own most of the data >>> that you are looking for. That coordinator would then need to fan out the >>> read to all the nodes owning the partitions you are looking for. >>> >>> Far better to just do it directly on the client. The token aware client >>> will send each request for a row straight to a node that owns it. With a >>> separate connection open to each node, this is done in parallel from the >>> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with >>> a stored procedure, very very little additional overhead to the client, >>> server, or network. >>> >>> -Tupshin >>> >>> >>> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.ke...@gmail.com>wrote: >>> >>>> Hi everyone, >>>> >>>> Let's say that I have a table that looks like the following: >>>> >>>> CREATE TABLE time_series_stuff ( >>>> key text, >>>> family text, >>>> version int, >>>> val text, >>>> PRIMARY KEY (key, family, version) >>>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND >>>> bloom_filter_fp_chance=0.010000 AND >>>> caching='KEYS_ONLY' AND >>>> comment='' AND >>>> dclocal_read_repair_chance=0.000000 AND >>>> gc_grace_seconds=864000 AND >>>> index_interval=128 AND >>>> read_repair_chance=0.100000 AND >>>> replicate_on_write='true' AND >>>> populate_io_cache_on_flush='false' AND >>>> default_time_to_live=0 AND >>>> speculative_retry='99.0PERCENTILE' AND >>>> memtable_flush_period_in_ms=0 AND >>>> compaction={'class': 'SizeTieredCompactionStrategy'} AND >>>> compression={'sstable_compression': 'LZ4Compressor'}; >>>> >>>> cqlsh:fiddle> select * from time_series_stuff ; >>>> >>>> key | family | version | val >>>> --------+---------+---------+-------- >>>> monday | revenue | 3 | $$$$$$ >>>> monday | revenue | 2 | $$$ >>>> monday | revenue | 1 | $$ >>>> monday | revenue | 0 | $ >>>> monday | traffic | 2 | medium >>>> monday | traffic | 1 | light >>>> monday | traffic | 0 | heavy >>>> >>>> (7 rows) >>>> >>>> Now let's say that I'd like to perform a query that gets me the most >>>> recent N versions of "revenue" and "traffic." >>>> >>>> Is there a CQL query to do this? Let's say that N=1. Then I know that >>>> I can do: >>>> >>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>>> family='revenue' limit 1; >>>> >>>> key | family | version | val >>>> --------+---------+---------+-------- >>>> monday | revenue | 3 | $$$$$$ >>>> >>>> (1 rows) >>>> >>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>>> family='traffic' limit 1; >>>> >>>> key | family | version | val >>>> --------+---------+---------+-------- >>>> monday | traffic | 2 | medium >>>> >>>> (1 rows) >>>> >>>> But what if I have lots of "families" and I want to get the most recent >>>> N versions of all of them in a single CQL statement. Is that possible? >>>> Unfortunately I am working on something where the family names and the >>>> number of most-recent versions are not known a priori (I am porting some >>>> code that was designed for HBase). >>>> >>>> Best regards, >>>> Clint >>>> >>> >>> >> >