Re: Correct model

aaron morton Sun, 23 Sep 2012 14:35:15 -0700

Yup.

(Multi get is just a convenience method, it explodes into multiple gets on the 
server side. )


Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2012, at 5:01 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote:

> But the only advantage in this solution is to split data among partitions?
> 
> You need to split data among partitions or your query won't scale as more and 
> more data is added to table.  Having the partition means you are querying a 
> lot less rows.
> 
> What do you mean here by current partition?
> 
> He means determine the ONE partition key and query that partition.  Ie. If 
> you want just latest user requests, figure out the partition key based on 
> which month you are in and query it.  If you want the latest independent of 
> user, query the correct single partition for GlobalRequests CF.
> 
> If I want all the requests for the user, couldn't I just select all 
> UserRequest records which start with "userId"?
> 
> He designed it so the user requests table was completely scalable so he has 
> partitions there.  If you don't have partitions, you could run into a row 
> that is toooo long.  You don't need to design it this way if you know none of 
> your users are going to go into the millions as far as number of requests.  
> In his design then, you need to pick the correct partition and query into 
> that partition.
> 
> I really didn't understand why to use partitions.
> 
> Partitions are a way if you know your rows will go into the trillions of 
> breaking them up so each partition has 100k rows or so or even 1 million but 
> maxes out in the millions most likely.  Without partitions, you hit a limit 
> in the millions.  With partitions, you can keep scaling past that as you can 
> have as many partitions as you want.
> 
> A multi-get is a query that finds IN PARALLEL all the rows with the matching 
> keys you send to cassandra.  If you do 1000 gets(instead of a multi-get) with 
> 1ms latency, you will find, it takes 1 second+processing time.  If you do ONE 
> multi-get, you only have 1 request and therefore 1ms latency.  The other 
> solution is you could send 1000 "asycnh" gets but I have a feeling that would 
> be slower with all the marshalling/unmarshalling of the envelope…..really 
> depends on the envelope size like if we were using http, you would get killed 
> doing 1000 requests instead of 1 with 1000 keys in it.
> 
> Later,
> Dean
> 
> From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto:mvall...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 10:23 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Correct model
> 
> 
> 2012/9/20 aaron morton 
> <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>
> I would consider:
> 
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
> 
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start of 
> a time partition that makes sense in your domain. e.g. partition monthly. 
> Generally want to avoid rows the grow forever, as a rule of thumb avoid rows 
> more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data store 
> the request in a single column using JSON or similar, with the column name a 
> timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property> to 
> store the request in many columns.
> * In either case consider using Reversed comparators so the most recent 
> columns are first  see 
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
> 
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to use 
> the same partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
> 
> Ok, I think I understood your suggestion... But the only advantage in this 
> solution is to split data among partitions? I understood how it would work, 
> but I didn't understand why it's better than the other solution, without the 
> GlobalRequests CF
> 
> - Select all the requests for an user
> Work out the current partition client side, get the first N columns. Then 
> page.
> 
> What do you mean here by current partition? You mean I would perform a query 
> for each particition? If I want all the requests for the user, couldn't I 
> just select all UserRequest records which start with "userId"? I might be 
> missing something here, but in my understanding if I use hector to query a 
> column familly I can do that and Cassandra servers will automatically 
> communicate to each other to get the data I need, right? Is it bad? I really 
> didn't understand why to use partitions.
> 
> 
> - Select all the users which has new requests, since date D
> Worm out the current partition client side, get the first N columns from 
> GlobalRequests, make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
> For sure it is helping a lot. However, I don't know what is a multiget... I 
> saw the hector api reference and found this method, but not sure about what 
> Cassandra would do internally if I do a multiget... Is this expensive in 
> terms of performance and latency?
> 
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: Correct model

Reply via email to