Regardless of the size of 'k', each key gets turned into a ReadCommand internally and is (eventually) delegated to StorageProxy#fetchRows: https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/StorageProxy.java#L852
You still send the same number of ReadCommands going out either way, but it's more a matter of throttling resource usage. Honestly, I would say play around with it for a little bit and monitor the results - there is most likely a good sweet spot in there somewhere - bonus if you can make it a stored procedure to mitigate the parsing overhead. On Thu, Nov 7, 2013 at 9:23 AM, Dan Gould <d...@chill.com> wrote: > Thanks--that's what I was wondering. So, if I understand you correctly, > it sounds like a single > SELECT ... WHERE foo in (k items); > > can tie up k threads rather than 1 thread per node which can starve other > tasks on a cluster. AFAICT, there's no way to say "this query should be > limited to only __% of the resources on each node". Alas, for every other > table in our system I've figured out nice ways to denormalize and turn > complex things in to single queries. But this one I can't. > > So--alas--it sounds like my best answer will be to issue lots of smaller > queries with pauses in-between. (Or to look at patching C* to be smarter > about resource management, but I'm not-at-all familiar with C* internals so > this may be impractical at the moment.) > > > On 11/6/13 8:26 PM, Aaron Morton wrote: > > If one big query doesn't cause problems > > > Every row you read becomes a (roughly) RF number of tasks in the > cluster. If you ask for 100 rows in one query it will generate 300 tasks > that are processed by the read thread pool which as a default of 32 > threads. If you ask for a lot of rows and the number of nodes in low there > is a chance the client starve others as they wait for all the tasks to be > completed. So i tend to like asking for fewer rows. > > Cheers > > ----------------- > Aaron Morton > New Zealand > @aaronmorton > > Co-Founder & Principal Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > On 7/11/2013, at 12:19 pm, Dan Gould <d...@chill.com> wrote: > > Thanks Nate, > > I assume 10k is the return limit. I don't think I'll ever get close to > 10k matches to the IN query. That said, you're right: to be safe I'll > increase the limit to match the number of items on the IN. > > I didn't know CQL supported stored procedures, but I'll take a look. I > suppose my question was asking about parsing overhead, however. If one big > query doesn't cause problems--which I assume it wouldn't since there can be > multiple threads parsing and I assume C* is smart about memory when > accumulating results--I'd much rather do that. > > Dan > > On 11/6/13 3:05 PM, Nate McCall wrote: > > Unless you explicitly set a page size (i'm pretty sure the query is > converted to a paging query automatically under the hood) you will get > capped at the default of 10k which might get a little weird semantically. > That said, you should experiment with explicit page sizes and see where it > gets you (i've not tried this yet with an IN clause - would be real curious > to hear how it worked). > > Another thing to consider is that it's a pretty big statement to parse > every time. You might want to go the (much) smaller batch route so these > can be stored procedures? (another thing I havent tried with IN clause - > don't see why it would not work though). > > > > > On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <d...@chill.com> wrote: > >> I was wondering if anyone had a sense of performance/best practices >> around the 'IN' predicate. >> >> I have a list of up to potentially ~30k keys that I want to look up in a >> table (typically queries will have <500, but I worry about the long >> tail). Most >> of them will not exist in the table, but, say, about 10-20% will. >> >> Would it be best to do: >> >> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000); >> >> 2) Split into smaller batches-- >> for group_of_100 in all_30000: >> // ** Issue in parallel or block after each one?? >> SELECT fields FROM table WHERE id in (group_of_100 uuids); >> >> 3) Something else? >> >> My guess is that (1) is fine and that the only worry is too much data >> returned (which won't be a problem in this case), but I wanted to check >> that it's not a C* anti-pattern before. >> >> [Conversely, is a batch insert with up to 30k items ok?] >> >> Thanks, >> Dan >> >> > > > -- > ----------------- > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > > > -- ----------------- Nate McCall Austin, TX @zznate Co-Founder & Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com