On Tue, Jan 18, 2011 at 4:29 PM, Shu Zhang <szh...@mediosystems.com> wrote: > Well, I don't think what I'm describing is complicated semantics. I think > I've described general batch operation design and something that is > symmetrical the batch_mutate method already on the Cassandra API. You are > right, I can solve the problem with further denormalization, and the approach > of making individual gets in parallel as described by Brandon will work too. > I'll be doing one of these for now. But I think neither is as efficient, and > I guess I'm still not sure why the multiget is designed the way it is. > > The problem with denormalization is you gotta make multiple row writes in > place of one, adding load to the server, adding required physical space and > losing atomicity on write operations. I know writes are cheap in cassandra, > and you can catch failed writes and retry so these problems are not major, > but it still seems clear that having a batch-get that works appropriately is > a least a little better... > ________________________________________ > From: Aaron Morton [aa...@thelastpickle.com] > Sent: Tuesday, January 18, 2011 12:55 PM > To: user@cassandra.apache.org > Subject: Re: please help with multiget > > I think the general approach is to denormalise data to remove the need for > complicated semantics when reading. > > Aaron > > On 19/01/2011, at 7:57 AM, Shu Zhang <szh...@mediosystems.com> wrote: > >> Well, maybe making a batch-get is not anymore efficient on the server side >> but without it, you can get bottlenecked on client-server connections and >> client resources. If the number of requests you want to batch is on the >> order of connections in your pool, then yes, making gets in parallel is as >> good or maybe better. But what if you want to batch thousands of requests? >> >> The server I can scale out, I would want to get my requests there without >> needing to wait for connections on my client to free up. >> >> I just don't really understand the reasoning for designing muliget_slice the >> way it is. I still think if you're gonna have a batch-get request >> (multiget_slice), you should be able to add to the batch a reasonable number >> of ANY corresponding non-batch get requests. And you can't do that... Plus, >> it's not symmetrical to the batch-mutate. Is there a good reason for that? >> ________________________________________ >> From: Brandon Williams [dri...@gmail.com] >> Sent: Monday, January 17, 2011 5:09 PM >> To: user@cassandra.apache.org >> Cc: hector-us...@googlegroups.com >> Subject: Re: please help with multiget >> >> On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang >> <szh...@mediosystems.com<mailto:szh...@mediosystems.com>> wrote: >> Here's the method declaration for quick reference: >> map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, >> list<string> keys, ColumnParent column_parent, SlicePredicate predicate, >> ConsistencyLevel consistency_level) >> >> It looks like you must have the same SlicePredicate for every key in your >> batch retrieval, so what are you suppose to do when you need to retrieve >> different columns for different keys? >> >> Issue multiple gets in parallel yourself. Keep in mind that multiget is not >> an optimization, in fact, it can work against you when one key exceeds the >> rpc timeout, because you get nothing back. >> >> -Brandon >
muliget_slice is very useful I IMHO. In my testing, the roundtrip time for 1000 get requests all being acked individually is much higher then rountrip time for 200 get_slice grouped 5 at a time. For anyone that needs that type of access they are in good shape. I was also theorizing that a CF using RowCache with very, very high read rate would benefit from "pooling" a bunch of reads together with multiget. I do agree that the first time I looked at the multi_get_slice signature I realized I could do many of the things I was expecting from a multi-get.