Hi Jack, I think you missed the point of my email which was trying to avoid the problem of having very wide rows :) In the notation of sensorId-datatime, the datatime is a datetime bucket, say a day. The CQL rows would still be keyed by the actual time of the event. So you’d end up having SesonId->Datetime Bucket (day/week/month)->actual event. What I wanted to be able to do was to colocate all the events related to a sensor id on a single node (token).
See "High Throughput Timelines” at http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra - Drew On Aug 29, 2014, at 3:58 PM, Jack Krupansky <j...@basetechnology.com> wrote: > With CQL3, you, the developer, get to decide whether to place a primary key > column in the partition key or as a clustering column. So, make sensorID the > partition key and datetime as a clustering column. > > -- Jack Krupansky > > From: Drew Kutcharian > Sent: Friday, August 29, 2014 6:48 PM > To: user@cassandra.apache.org > Subject: Data partitioning and composite partition key > > Hey Guys, > > AFAIK, currently Cassandra partitions (thrift) rows using the row key, > basically uses the hash(row_key) to decide what node that row needs to be > stored on. Now there are times when there is a need to shard a wide row, say > storing events per sensor, so you’d have sensorId-datetime row key so you > don’t end up with very large rows. Is there a way to have the partitioner use > only the “sensorId” part of the row key for the hash? This way we would be > able to store all the data relating to a sensor in one node. > > Another use case of this would be multi-tenancy: > > Say we have accounts and accounts have users. So we would have the following > tables: > > CREATE TABLE account ( > id timeuuid PRIMARY KEY, > company text //timezone > ); > > CREATE TABLE user ( > id timeuuid PRIMARY KEY, > accountId timeuuid, > email text, > password text > ); > > // Get users by account > CREATE TABLE user_account_index ( > accountId timeuuid, > userId timeuuid, > PRIMARY KEY(acid, id) > ); > > Say I want to get all the users that belong to an account. I would first have > to get the results from user_account_index and then use a multi-get (WHERE > IN) to get the records from user table. Now this multi-get part could > potentially query a lot of different nodes in the cluster. It’d be great if > there was a way to limit storage of users of an account to a single node so > that way multi-get would only need to query a single node. > > Note that the problem cannot be simply fixed by using (accountId, id) as the > primary key for the user table since that would create a problem of having a > very large number of (thrift) rows in the users table. > > I did look thru the code and JIRA and I couldn’t really find a solution. The > closest I got was to have a custom partitioner, but then you can’t have a > partitioner per keyspace and that’s not even something that’d be implemented > in future based on the following JIRA: > https://issues.apache.org/jira/browse/CASSANDRA-295 > > Any ideas are much appreciated. > > Best, > > Drew