Hey Guys, AFAIK, currently Cassandra partitions (thrift) rows using the row key, basically uses the hash(row_key) to decide what node that row needs to be stored on. Now there are times when there is a need to shard a wide row, say storing events per sensor, so you’d have sensorId-datetime row key so you don’t end up with very large rows. Is there a way to have the partitioner use only the “sensorId” part of the row key for the hash? This way we would be able to store all the data relating to a sensor in one node.
Another use case of this would be multi-tenancy: Say we have accounts and accounts have users. So we would have the following tables: CREATE TABLE account ( id timeuuid PRIMARY KEY, company text //timezone ); CREATE TABLE user ( id timeuuid PRIMARY KEY, accountId timeuuid, email text, password text ); // Get users by account CREATE TABLE user_account_index ( accountId timeuuid, userId timeuuid, PRIMARY KEY(acid, id) ); Say I want to get all the users that belong to an account. I would first have to get the results from user_account_index and then use a multi-get (WHERE IN) to get the records from user table. Now this multi-get part could potentially query a lot of different nodes in the cluster. It’d be great if there was a way to limit storage of users of an account to a single node so that way multi-get would only need to query a single node. Note that the problem cannot be simply fixed by using (accountId, id) as the primary key for the user table since that would create a problem of having a very large number of (thrift) rows in the users table. I did look thru the code and JIRA and I couldn’t really find a solution. The closest I got was to have a custom partitioner, but then you can’t have a partitioner per keyspace and that’s not even something that’d be implemented in future based on the following JIRA: https://issues.apache.org/jira/browse/CASSANDRA-295 Any ideas are much appreciated. Best, Drew