Pretty much yes. Although I think it’d be nice if Cassandra handled such a 
case, I’ve resigned to the fact that it cannot at the moment. The workaround 
will be to partition on the LSB portion of the id (giving 256 rows spread 
amongst my nodes) which allows room for scaling, and then cluster each row on 
geohash or something else.

Basically this desire all stems from wanting efficient use of memory. 
Frequently accessed keys’ values are kept in RAM through the OS page cache. But 
the page size is 4KB. This is a problem if you are accessing several small 
records of data (say 200 bytes), since each record only occupies a small % of a 
page. This is why it’s important to increase the probability that neighboring 
data on the disk is relevant. Worst case would be to read in a full 4KB page 
into RAM, of which you’re only accessing one record that’s a couple hundred 
bytes. All of the other unused data of the page is wastefully occupying RAM. 
Now project this problem to a collection of millions of small records all 
indiscriminately and randomly scattered on the disk, and you can easily see how 
inefficient your memory usage will become.

That’s why it’s best to cluster data in some meaningful way, all in an effort 
to increasing the probability that when one record is accessed in that 4KB 
block that its neighboring records will also be accessed. This brings me back 
to the question of this thread. I want to randomly distribute the data amongst 
the nodes to avoid hot spotting, but within each node I want to cluster the 
data meaningfully such that the probability that neighboring data is relevant 
is increased.

An example of this would be having a huge collection of small records that 
store basic user information. If you partition on the unique user id, then 
you’ll get nice random distribution but with no ability to cluster (each record 
would occupy its own row). You could partition on say geographical region, but 
then you’ll end up with hot spotting when one region is more active than 
another. So ideally you’d like to randomly assign a node to each record to 
increase parallelism, but then cluster all records on a node by say geohash 
since it is more likely (however small that may be) that when one user from a 
geographical region is accessed other users from the same region will also need 
to be accessed. It’s certainly better than having some random user record next 
to the one you are accessing at the moment.




On Dec 3, 2013, at 11:32 PM, Vivek Mishra <mishra.v...@gmail.com> wrote:

> So Basically you want to create a cluster of multiple unique keys, but data 
> which belongs to one unique should be colocated. correct?
> 
> -Vivek
> 
> 
> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespend...@gmail.com> 
> wrote:
> Subject says it all. I want to be able to randomly distribute a large set of 
> records but keep them clustered in one wide row per node.
> 
> As an example, lets say I’ve got a collection of about 1 million records each 
> with a unique id. If I just go ahead and set the primary key (and therefore 
> the partition key) as the unique id, I’ll get very good random distribution 
> across my server cluster. However, each record will be its own row. I’d like 
> to have each record belong to one large wide row (per server node) so I can 
> have them sorted or clustered on some other column.
> 
> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 
> at the time of creation and have the partition key set to this value. But 
> this becomes troublesome if I add or remove nodes. What effectively I want is 
> to partition on the unique id of the record modulus N (id % N; where N is the 
> number of nodes).
> 
> I have to imagine there’s a mechanism in Cassandra to simply randomize the 
> partitioning without even using a key (and then clustering on some column).
> 
> Thanks for any help.
> 

Reply via email to