Re: CQL 'IN' predicate

Aaron Morton Wed, 06 Nov 2013 20:27:01 -0800

> If one big query doesn't cause problems

Every row you read becomes a (roughly) RF number of tasks in the cluster. If 
you ask for 100 rows in one query it will generate 300 tasks that are processed 
by the read thread pool which as a default of 32 threads. If you ask for a lot 
of rows and the number of nodes in low there is a chance the client starve 
others as they wait for all the tasks to be completed. So i tend to like asking 
for fewer rows.


Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/11/2013, at 12:19 pm, Dan Gould <d...@chill.com> wrote:

> Thanks Nate,
> 
> I assume 10k is the return limit.  I don't think I'll ever get close to 10k 
> matches to the IN query.  That said, you're right: to be safe I'll increase 
> the limit to match the number of items on the IN.
> 
> I didn't know CQL supported stored procedures, but I'll take a look.  I 
> suppose my question was asking about parsing overhead, however.  If one big 
> query doesn't cause problems--which I assume it wouldn't since there can be 
> multiple threads parsing and I assume C* is smart about memory when 
> accumulating results--I'd much rather do that.
> 
> Dan
> 
> On 11/6/13 3:05 PM, Nate McCall wrote:
>> Unless you explicitly set a page size (i'm pretty sure the query is 
>> converted to a paging query automatically under the hood) you will get 
>> capped at the default of 10k which might get a little weird semantically. 
>> That said, you should experiment with explicit page sizes and see where it 
>> gets you (i've not tried this yet with an IN clause - would be real curious 
>> to hear how it worked). 
>> 
>> Another thing to consider is that it's a pretty big statement to parse every 
>> time. You might want to go the (much) smaller batch route so these can be 
>> stored procedures? (another thing I havent tried with IN clause - don't see 
>> why it would not work though).
>> 
>> 
>> 
>> 
>> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <d...@chill.com> wrote:
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>> 
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long tail).  
>> Most
>> of them will not exist in the table, but, say, about 10-20% will.
>> 
>> Would it be best to do:
>> 
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>> 
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>> 
>> 3) Something else?
>> 
>> My guess is that (1) is fine and that the only worry is too much data 
>> returned (which won't be a problem in this case), but I wanted to check that 
>> it's not a C* anti-pattern before.
>> 
>> [Conversely, is a batch insert with up to 30k items ok?]
>> 
>> Thanks,
>> Dan
>> 
>> 
>> 
>> 
>> -- 
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>> 
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>

Re: CQL 'IN' predicate

Reply via email to