Mainly lower latency and (network overhead) in multi-get requests (WHERE IN (….)). The coordinator needs to connect only to one node vs potentially all the nodes in the cluster.
On Aug 29, 2014, at 5:23 PM, Jack Krupansky <j...@basetechnology.com> wrote: > Okay, but what benefit do you think you get from having the partitions on the > same node – since they would be separate partitions anyway? I mean, what > exactly do you think you’re going to do with them, that wouldn’t be a whole > lot more performant by being able to process data in parallel from separate > nodes? I mean, the whole point of Cassandra is scalability and distributed > processing, right? > > -- Jack Krupansky > > From: Drew Kutcharian > Sent: Friday, August 29, 2014 7:31 PM > To: user@cassandra.apache.org > Subject: Re: Data partitioning and composite partition key > > Hi Jack, > > I think you missed the point of my email which was trying to avoid the > problem of having very wide rows :) In the notation of sensorId-datatime, > the datatime is a datetime bucket, say a day. The CQL rows would still be > keyed by the actual time of the event. So you’d end up having > SesonId->Datetime Bucket (day/week/month)->actual event. What I wanted to be > able to do was to colocate all the events related to a sensor id on a single > node (token). > > See "High Throughput Timelines” at > http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra > > - Drew > > > On Aug 29, 2014, at 3:58 PM, Jack Krupansky <j...@basetechnology.com> wrote: > >> With CQL3, you, the developer, get to decide whether to place a primary key >> column in the partition key or as a clustering column. So, make sensorID the >> partition key and datetime as a clustering column. >> >> -- Jack Krupansky >> >> From: Drew Kutcharian >> Sent: Friday, August 29, 2014 6:48 PM >> To: user@cassandra.apache.org >> Subject: Data partitioning and composite partition key >> >> Hey Guys, >> >> AFAIK, currently Cassandra partitions (thrift) rows using the row key, >> basically uses the hash(row_key) to decide what node that row needs to be >> stored on. Now there are times when there is a need to shard a wide row, say >> storing events per sensor, so you’d have sensorId-datetime row key so you >> don’t end up with very large rows. Is there a way to have the partitioner >> use only the “sensorId” part of the row key for the hash? This way we would >> be able to store all the data relating to a sensor in one node. >> >> Another use case of this would be multi-tenancy: >> >> Say we have accounts and accounts have users. So we would have the following >> tables: >> >> CREATE TABLE account ( >> id timeuuid PRIMARY KEY, >> company text //timezone >> ); >> >> CREATE TABLE user ( >> id timeuuid PRIMARY KEY, >> accountId timeuuid, >> email text, >> password text >> ); >> >> // Get users by account >> CREATE TABLE user_account_index ( >> accountId timeuuid, >> userId timeuuid, >> PRIMARY KEY(acid, id) >> ); >> >> Say I want to get all the users that belong to an account. I would first >> have to get the results from user_account_index and then use a multi-get >> (WHERE IN) to get the records from user table. Now this multi-get part could >> potentially query a lot of different nodes in the cluster. It’d be great if >> there was a way to limit storage of users of an account to a single node so >> that way multi-get would only need to query a single node. >> >> Note that the problem cannot be simply fixed by using (accountId, id) as the >> primary key for the user table since that would create a problem of having a >> very large number of (thrift) rows in the users table. >> >> I did look thru the code and JIRA and I couldn’t really find a solution. The >> closest I got was to have a custom partitioner, but then you can’t have a >> partitioner per keyspace and that’s not even something that’d be implemented >> in future based on the following JIRA: >> https://issues.apache.org/jira/browse/CASSANDRA-295 >> >> Any ideas are much appreciated. >> >> Best, >> >> Drew > >