You need to switch gears to new terminology as well - a "thrift row" is a partition now, etc... :)
So yes - the *partition* key of the *table* would be scopeId, scopeType in my proposed scheme. But the partitions would be too big, given what you describe. You could shard the rows, but even then they would be large and retrieval with IN on the shard would put a lot of pressure on the cluster and coordinator. That's what we do to avoid hot spots but our numbers are much smaller. Also we never delete, just ttl and compact. If the type of a node is known at the time its uuid is assigned, you could embed the type in the uuid, e.g. by taking over either part of the MAC address or some of the random bits in a uuid v1. This would greatly simplify the problem (presuming the types have low cardinality). E.g: CREATE TABLE IF NOT EXISTS Graph_Marked_Nodes ( scopeId uuid, scopeType varchar, nodeIdType timeuuid, timestamp bigint, PRIMARY KEY ((scopeId , scopeType, nodeIdType)) ); SELECT timestamp FROM Graph_Marked_Nodes WHERE scopeId = ? AND scopeType = ? AND (nodeIdType) IN (uuid1foo, uuid2bar, uuid3foo); A possible similar approach would be to use User Defined Types in 2.1, but I haven't even looked at that yet. There are blog posts from Datastax describing internal structures - and then there is the source of course :) ml On Sun, Aug 31, 2014 at 11:06 AM, Todd Nine <todd.n...@gmail.com> wrote: > Hey Michael, > Thanks for the response. If I use the clustered columns in the way you > described, won't that make the row key of the column family scopeId and > scopeType? > > The scope fields represent a graph's owner. The graph itself can have > several billion nodes in it. When a lot of deletes start occurring on the > same graph, I will quickly saturate the row capacity of a column family if > the physical row key is only the scope. > > This is why I have each node on its own row key. As long as our cluster > has the capacity to handle the load, we won't hit the upper bounds of the > maximum columns in a row. > > I'm new to CQL in our code. I've only been using it for > administration. I've been using the thrift interface in code since the 0.6 > days. > > I feel I have a strong understanding of the internals of the column family > structure. I'm struggling to find documentation on the CQL to physical > layout that isn't a trivial example, especially are around multiget use > cases. Do you have any pointers to blogs or tutorials you've found > helpful? > > Thanks, > Todd > > > On Sunday, August 31, 2014, Laing, Michael <michael.l...@nytimes.com> > wrote: > >> Actually I think you do want to use scopeId, scopeType as the partition >> key (and drop row caching until you upgrade to 2.1 where "rows" are in fact >> rows and not partitions): >> >> CREATE TABLE IF NOT EXISTS Graph_Marked_Nodes >> ( >> scopeId uuid, scopeType varchar, nodeId uuid, nodeType varchar, >> timestamp bigint, >> PRIMARY KEY ((scopeId , scopeType), nodeId, nodeType) >> ); >> >> Then you can select using IN on the cartesian product of your clustering >> keys: >> >> SELECT timestamp >> FROM Graph_Marked_Nodes >> WHERE scopeId = ? >> AND scopeType = ? >> AND (nodeId, nodeType) IN ( >> (uuid1, 'foo'), (uuid1, 'bar'), >> (uuid2, 'foo'), (uuid2, 'bar'), >> (uuid3, 'foo'), (uuid3, 'bar') >> ); >> >> ml >> >> PS Of course you could boldly go to 2.1 now for a nice performance boost >> :) >> >> >> >> >> On Sat, Aug 30, 2014 at 8:59 PM, Todd Nine <toddn...@apache.org> wrote: >> >>> Hi all, >>> I'm working on transferring our thrift DAOs over to CQL. It's going >>> well, except for 2 cases that both use multi get. The use case is very >>> simple. It is a narrow row, by design, with only a few columns. When I >>> perform a multiget, I need to get up to 1k rows at a time. I do not want >>> to turn these into a wide row using scopeId and scopeType as the row key. >>> >>> >>> On the physical level, my Column Family needs something similar to the >>> following format. >>> >>> >>> scopeId, scopeType, nodeId, nodeType :{ timestamp: 0x00 } >>> >>> >>> I've defined by table with the following CQL. >>> >>> >>> CREATE TABLE IF NOT EXISTS Graph_Marked_Nodes >>> ( scopeId uuid, scopeType varchar, nodeId uuid, nodeType varchar, >>> timestamp bigint, >>> PRIMARY KEY ((scopeId , scopeType, nodeId, nodeType)) >>> )WITH caching = 'all' >>> >>> >>> This works well for inserts deletes and single reads. I always know the >>> scopeId, scopeType, nodeId, and nodeType, so I want to return the timestamp >>> columns. I thought I could use the IN operation and specify the pairs of >>> nodeId and nodeTypes I have as input, however this doesn't work. >>> >>> Can anyone give me a suggestion on how to perform a multiget when I have >>> several values for the nodeId and the nodeType? This read occurs on every >>> read of edges so making 1k trips is not going to work from a performance >>> perspective. >>> >>> Below is the query I've tried. >>> >>> SELECT timestamp FROM Graph_Marked_Nodes WHERE scopeId = ? AND >>> scopeType = ? AND nodeId IN (uuid1, uuid2, uuid3) AND nodeType IN >>> ('foo','bar') >>> >>> I've found this issue, which looks like it's a solution to my problem. >>> >>> https://issues.apache.org/jira/browse/CASSANDRA-6875 >>> >>> However, I'm not able to get the syntax in the issue description to work >>> either. Any input would be appreciated! >>> >>> Cassandra: 2.0.10 >>> Datastax Driver: 2.1.0 >>> >>> Thanks, >>> Todd >>> >>> >>> >>> >>> >>