Yup. (Multi get is just a convenience method, it explodes into multiple gets on the server side. )
Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/09/2012, at 5:01 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote: > But the only advantage in this solution is to split data among partitions? > > You need to split data among partitions or your query won't scale as more and > more data is added to table. Having the partition means you are querying a > lot less rows. > > What do you mean here by current partition? > > He means determine the ONE partition key and query that partition. Ie. If > you want just latest user requests, figure out the partition key based on > which month you are in and query it. If you want the latest independent of > user, query the correct single partition for GlobalRequests CF. > > If I want all the requests for the user, couldn't I just select all > UserRequest records which start with "userId"? > > He designed it so the user requests table was completely scalable so he has > partitions there. If you don't have partitions, you could run into a row > that is toooo long. You don't need to design it this way if you know none of > your users are going to go into the millions as far as number of requests. > In his design then, you need to pick the correct partition and query into > that partition. > > I really didn't understand why to use partitions. > > Partitions are a way if you know your rows will go into the trillions of > breaking them up so each partition has 100k rows or so or even 1 million but > maxes out in the millions most likely. Without partitions, you hit a limit > in the millions. With partitions, you can keep scaling past that as you can > have as many partitions as you want. > > A multi-get is a query that finds IN PARALLEL all the rows with the matching > keys you send to cassandra. If you do 1000 gets(instead of a multi-get) with > 1ms latency, you will find, it takes 1 second+processing time. If you do ONE > multi-get, you only have 1 request and therefore 1ms latency. The other > solution is you could send 1000 "asycnh" gets but I have a feeling that would > be slower with all the marshalling/unmarshalling of the envelopeā¦..really > depends on the envelope size like if we were using http, you would get killed > doing 1000 requests instead of 1 with 1000 keys in it. > > Later, > Dean > > From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto:mvall...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Sunday, September 23, 2012 10:23 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Correct model > > > 2012/9/20 aaron morton > <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>> > I would consider: > > # User CF > * row_key: user_id > * columns: user properties, key=value > > # UserRequests CF > * row_key: <user_id : partition_start> where partition_start is the start of > a time partition that makes sense in your domain. e.g. partition monthly. > Generally want to avoid rows the grow forever, as a rule of thumb avoid rows > more than a few 10's of MB. > * columns: two possible approaches: > 1) If the requests are immutable and you generally want all of the data store > the request in a single column using JSON or similar, with the column name a > timestamp. > 2) Otherwise use a composite column name of <timestamp : request_property> to > store the request in many columns. > * In either case consider using Reversed comparators so the most recent > columns are first see > http://thelastpickle.com/2011/10/03/Reverse-Comparators/ > > # GlobalRequests CF > * row_key: partition_start - time partition as above. It may be easier to use > the same partition scheme. > * column name: <timestamp : user_id> > * column value: empty > > Ok, I think I understood your suggestion... But the only advantage in this > solution is to split data among partitions? I understood how it would work, > but I didn't understand why it's better than the other solution, without the > GlobalRequests CF > > - Select all the requests for an user > Work out the current partition client side, get the first N columns. Then > page. > > What do you mean here by current partition? You mean I would perform a query > for each particition? If I want all the requests for the user, couldn't I > just select all UserRequest records which start with "userId"? I might be > missing something here, but in my understanding if I use hector to query a > column familly I can do that and Cassandra servers will automatically > communicate to each other to get the data I need, right? Is it bad? I really > didn't understand why to use partitions. > > > - Select all the users which has new requests, since date D > Worm out the current partition client side, get the first N columns from > GlobalRequests, make a multi get call to UserRequests > NOTE: Assuming the size of the global requests space is not huge. > Hope that helps. > For sure it is helping a lot. However, I don't know what is a multiget... I > saw the hector api reference and found this method, but not sure about what > Cassandra would do internally if I do a multiget... Is this expensive in > terms of performance and latency? > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr