Re: Data partitioning and composite partition key

Drew Kutcharian Fri, 29 Aug 2014 16:32:46 -0700

Hi Jack,

I think you missed the point of my email which was trying to avoid the problem 
of having very wide rows :)  In the notation of sensorId-datatime, the datatime 
is a datetime bucket, say a day. The CQL rows would still be keyed by the 
actual time of the event. So you’d end up having SesonId->Datetime Bucket 
(day/week/month)->actual event. What I wanted to be able to do was to colocate 
all the events related to a sensor id on a single node (token).


See "High Throughput Timelines” at 
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

- Drew


On Aug 29, 2014, at 3:58 PM, Jack Krupansky <j...@basetechnology.com> wrote:

> With CQL3, you, the developer, get to decide whether to place a primary key 
> column in the partition key or as a clustering column. So, make sensorID the 
> partition key and datetime as a clustering column.
>  
> -- Jack Krupansky
>  
> From: Drew Kutcharian
> Sent: Friday, August 29, 2014 6:48 PM
> To: user@cassandra.apache.org
> Subject: Data partitioning and composite partition key
>  
> Hey Guys,
>  
> AFAIK, currently Cassandra partitions (thrift) rows using the row key, 
> basically uses the hash(row_key) to decide what node that row needs to be 
> stored on. Now there are times when there is a need to shard a wide row, say 
> storing events per sensor, so you’d have sensorId-datetime row key so you 
> don’t end up with very large rows. Is there a way to have the partitioner use 
> only the “sensorId” part of the row key for the hash? This way we would be 
> able to store all the data relating to a sensor in one node.
>  
> Another use case of this would be multi-tenancy:
>  
> Say we have accounts and accounts have users. So we would have the following 
> tables:
>  
> CREATE TABLE account (
>   id                     timeuuid PRIMARY KEY,
>   company         text      //timezone
> );
>  
> CREATE TABLE user (
>   id              timeuuid PRIMARY KEY,
>   accountId timeuuid,
>   email        text,
>   password text
> );
>  
> // Get users by account
> CREATE TABLE user_account_index (
>   accountId  timeuuid,
>   userId        timeuuid,
>   PRIMARY KEY(acid, id)
> );
>  
> Say I want to get all the users that belong to an account. I would first have 
> to get the results from user_account_index and then use a multi-get (WHERE 
> IN) to get the records from user table. Now this multi-get part could 
> potentially query a lot of different nodes in the cluster. It’d be great if 
> there was a way to limit storage of users of an account to a single node so 
> that way multi-get would only need to query a single node.
>  
> Note that the problem cannot be simply fixed by using (accountId, id) as the 
> primary key for the user table since that would create a problem of having a 
> very large number of (thrift) rows in the users table.
>  
> I did look thru the code and JIRA and I couldn’t really find a solution. The 
> closest I got was to have a custom partitioner, but then you can’t have a 
> partitioner per keyspace and that’s not even something that’d be implemented 
> in future based on the following JIRA:
> https://issues.apache.org/jira/browse/CASSANDRA-295
>  
> Any ideas are much appreciated.
>  
> Best,
>  
> Drew

Re: Data partitioning and composite partition key

Reply via email to