Pretty much yes. Although I think it’d be nice if Cassandra handled such a case, I’ve resigned to the fact that it cannot at the moment. The workaround will be to partition on the LSB portion of the id (giving 256 rows spread amongst my nodes) which allows room for scaling, and then cluster each row on geohash or something else.
Basically this desire all stems from wanting efficient use of memory. Frequently accessed keys’ values are kept in RAM through the OS page cache. But the page size is 4KB. This is a problem if you are accessing several small records of data (say 200 bytes), since each record only occupies a small % of a page. This is why it’s important to increase the probability that neighboring data on the disk is relevant. Worst case would be to read in a full 4KB page into RAM, of which you’re only accessing one record that’s a couple hundred bytes. All of the other unused data of the page is wastefully occupying RAM. Now project this problem to a collection of millions of small records all indiscriminately and randomly scattered on the disk, and you can easily see how inefficient your memory usage will become. That’s why it’s best to cluster data in some meaningful way, all in an effort to increasing the probability that when one record is accessed in that 4KB block that its neighboring records will also be accessed. This brings me back to the question of this thread. I want to randomly distribute the data amongst the nodes to avoid hot spotting, but within each node I want to cluster the data meaningfully such that the probability that neighboring data is relevant is increased. An example of this would be having a huge collection of small records that store basic user information. If you partition on the unique user id, then you’ll get nice random distribution but with no ability to cluster (each record would occupy its own row). You could partition on say geographical region, but then you’ll end up with hot spotting when one region is more active than another. So ideally you’d like to randomly assign a node to each record to increase parallelism, but then cluster all records on a node by say geohash since it is more likely (however small that may be) that when one user from a geographical region is accessed other users from the same region will also need to be accessed. It’s certainly better than having some random user record next to the one you are accessing at the moment. On Dec 3, 2013, at 11:32 PM, Vivek Mishra <mishra.v...@gmail.com> wrote: > So Basically you want to create a cluster of multiple unique keys, but data > which belongs to one unique should be colocated. correct? > > -Vivek > > > On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespend...@gmail.com> > wrote: > Subject says it all. I want to be able to randomly distribute a large set of > records but keep them clustered in one wide row per node. > > As an example, lets say I’ve got a collection of about 1 million records each > with a unique id. If I just go ahead and set the primary key (and therefore > the partition key) as the unique id, I’ll get very good random distribution > across my server cluster. However, each record will be its own row. I’d like > to have each record belong to one large wide row (per server node) so I can > have them sorted or clustered on some other column. > > If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 > at the time of creation and have the partition key set to this value. But > this becomes troublesome if I add or remove nodes. What effectively I want is > to partition on the unique id of the record modulus N (id % N; where N is the > number of nodes). > > I have to imagine there’s a mechanism in Cassandra to simply randomize the > partitioning without even using a key (and then clustering on some column). > > Thanks for any help. >