2012/9/23 Hiller, Dean <dean.hil...@nrel.gov> > You need to split data among partitions or your query won't scale as more > and more data is added to table. Having the partition means you are > querying a lot less rows. > This will happen in case I can query just one partition. But if I need to query things in multiple partitions, wouldn't it be slower?
> He means determine the ONE partition key and query that partition. Ie. If > you want just latest user requests, figure out the partition key based on > which month you are in and query it. If you want the latest independent of > user, query the correct single partition for GlobalRequests CF. > But in this case, I didn't understand Aaron's model then. My first query is to get all requests for a user. If I did partitions by time, I will need to query all partitions to get the results, right? In his answer it was said I would query ONE partition... > If I want all the requests for the user, couldn't I just select all > UserRequest records which start with "userId"? > He designed it so the user requests table was completely scalable so he > has partitions there. If you don't have partitions, you could run into a > row that is toooo long. You don't need to design it this way if you know > none of your users are going to go into the millions as far as number of > requests. In his design then, you need to pick the correct partition and > query into that partition. > You mean too many rows, not a row too long, right? I am assuming each request will be a different row, not a new column. Is having billions of ROWS something non performatic in Cassandra? I know Cassandra allows up to 2 billion columns for a CF, but I am not aware of a limitation for rows... > I really didn't understand why to use partitions. > Partitions are a way if you know your rows will go into the trillions of > breaking them up so each partition has 100k rows or so or even 1 million > but maxes out in the millions most likely. Without partitions, you hit a > limit in the millions. With partitions, you can keep scaling past that as > you can have as many partitions as you want. > If I understood it correctly, if I don't specify partitions, Cassandra will store all my data in a single node? I thought Cassandra would automatically distribute my data among nodes as I insert rows into a CF. Of course if I use partitions I understand I could query just one partition (node) to get the data, if I have the partition field, but to the best of my knowledge, this is not what happens in my case, right? In the first query I would have to query all the partitions... Or you are saying partitions have nothing to do with nodes?? I 99,999% of my users will have less than 100k requests, would it make sense to partition by user? > A multi-get is a query that finds IN PARALLEL all the rows with the > matching keys you send to cassandra. If you do 1000 gets(instead of a > multi-get) with 1ms latency, you will find, it takes 1 second+processing > time. If you do ONE multi-get, you only have 1 request and therefore 1ms > latency. The other solution is you could send 1000 "asycnh" gets but I > have a feeling that would be slower with all the marshalling/unmarshalling > of the envelopeā¦..really depends on the envelope size like if we were using > http, you would get killed doing 1000 requests instead of 1 with 1000 keys > in it. > That's cool! :D So if I need to query data split in 10 partitions, for instance, I can perform the query in parallel by using a multiget, right? Out of curiosity, if each get will occur on a different node, I would need to connect to each of the nodes? Or would I query 1 node and it would communicate to others? > > Later, > Dean > > From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto: > mvall...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Sunday, September 23, 2012 10:23 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Correct model > > > 2012/9/20 aaron morton <aa...@thelastpickle.com<mailto: > aa...@thelastpickle.com>> > I would consider: > > # User CF > * row_key: user_id > * columns: user properties, key=value > > # UserRequests CF > * row_key: <user_id : partition_start> where partition_start is the start > of a time partition that makes sense in your domain. e.g. partition > monthly. Generally want to avoid rows the grow forever, as a rule of thumb > avoid rows more than a few 10's of MB. > * columns: two possible approaches: > 1) If the requests are immutable and you generally want all of the data > store the request in a single column using JSON or similar, with the column > name a timestamp. > 2) Otherwise use a composite column name of <timestamp : request_property> > to store the request in many columns. > * In either case consider using Reversed comparators so the most recent > columns are first see > http://thelastpickle.com/2011/10/03/Reverse-Comparators/ > > # GlobalRequests CF > * row_key: partition_start - time partition as above. It may be easier to > use the same partition scheme. > * column name: <timestamp : user_id> > * column value: empty > > Ok, I think I understood your suggestion... But the only advantage in this > solution is to split data among partitions? I understood how it would work, > but I didn't understand why it's better than the other solution, without > the GlobalRequests CF > > - Select all the requests for an user > Work out the current partition client side, get the first N columns. Then > page. > > What do you mean here by current partition? You mean I would perform a > query for each particition? If I want all the requests for the user, > couldn't I just select all UserRequest records which start with "userId"? I > might be missing something here, but in my understanding if I use hector to > query a column familly I can do that and Cassandra servers will > automatically communicate to each other to get the data I need, right? Is > it bad? I really didn't understand why to use partitions. > > > - Select all the users which has new requests, since date D > Worm out the current partition client side, get the first N columns from > GlobalRequests, make a multi get call to UserRequests > NOTE: Assuming the size of the global requests space is not huge. > Hope that helps. > For sure it is helping a lot. However, I don't know what is a multiget... > I saw the hector api reference and found this method, but not sure about > what Cassandra would do internally if I do a multiget... Is this expensive > in terms of performance and latency? > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr