> Basically this desire all stems from wanting efficient use of memory. 
Do you have any real latency numbers you are trying to tune ? 

Otherwise this sounds a little like premature optimisation.

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 5/12/2013, at 6:16 am, onlinespending <onlinespend...@gmail.com> wrote:

> Pretty much yes. Although I think it’d be nice if Cassandra handled such a 
> case, I’ve resigned to the fact that it cannot at the moment. The workaround 
> will be to partition on the LSB portion of the id (giving 256 rows spread 
> amongst my nodes) which allows room for scaling, and then cluster each row on 
> geohash or something else.
> 
> Basically this desire all stems from wanting efficient use of memory. 
> Frequently accessed keys’ values are kept in RAM through the OS page cache. 
> But the page size is 4KB. This is a problem if you are accessing several 
> small records of data (say 200 bytes), since each record only occupies a 
> small % of a page. This is why it’s important to increase the probability 
> that neighboring data on the disk is relevant. Worst case would be to read in 
> a full 4KB page into RAM, of which you’re only accessing one record that’s a 
> couple hundred bytes. All of the other unused data of the page is wastefully 
> occupying RAM. Now project this problem to a collection of millions of small 
> records all indiscriminately and randomly scattered on the disk, and you can 
> easily see how inefficient your memory usage will become.
> 
> That’s why it’s best to cluster data in some meaningful way, all in an effort 
> to increasing the probability that when one record is accessed in that 4KB 
> block that its neighboring records will also be accessed. This brings me back 
> to the question of this thread. I want to randomly distribute the data 
> amongst the nodes to avoid hot spotting, but within each node I want to 
> cluster the data meaningfully such that the probability that neighboring data 
> is relevant is increased.
> 
> An example of this would be having a huge collection of small records that 
> store basic user information. If you partition on the unique user id, then 
> you’ll get nice random distribution but with no ability to cluster (each 
> record would occupy its own row). You could partition on say geographical 
> region, but then you’ll end up with hot spotting when one region is more 
> active than another. So ideally you’d like to randomly assign a node to each 
> record to increase parallelism, but then cluster all records on a node by say 
> geohash since it is more likely (however small that may be) that when one 
> user from a geographical region is accessed other users from the same region 
> will also need to be accessed. It’s certainly better than having some random 
> user record next to the one you are accessing at the moment.
> 
> 
> 
> 
> On Dec 3, 2013, at 11:32 PM, Vivek Mishra <mishra.v...@gmail.com> wrote:
> 
>> So Basically you want to create a cluster of multiple unique keys, but data 
>> which belongs to one unique should be colocated. correct?
>> 
>> -Vivek
>> 
>> 
>> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespend...@gmail.com> 
>> wrote:
>> Subject says it all. I want to be able to randomly distribute a large set of 
>> records but keep them clustered in one wide row per node.
>> 
>> As an example, lets say I’ve got a collection of about 1 million records 
>> each with a unique id. If I just go ahead and set the primary key (and 
>> therefore the partition key) as the unique id, I’ll get very good random 
>> distribution across my server cluster. However, each record will be its own 
>> row. I’d like to have each record belong to one large wide row (per server 
>> node) so I can have them sorted or clustered on some other column.
>> 
>> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 
>> 5 at the time of creation and have the partition key set to this value. But 
>> this becomes troublesome if I add or remove nodes. What effectively I want 
>> is to partition on the unique id of the record modulus N (id % N; where N is 
>> the number of nodes).
>> 
>> I have to imagine there’s a mechanism in Cassandra to simply randomize the 
>> partitioning without even using a key (and then clustering on some column).
>> 
>> Thanks for any help.
>> 
> 

Reply via email to