Re: CQL 'IN' predicate

Nate McCall Thu, 07 Nov 2013 08:42:19 -0800

Regardless of the size of 'k', each key gets turned into a ReadCommand
internally and is (eventually) delegated to StorageProxy#fetchRows:
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/StorageProxy.java#L852


You still send the same number of ReadCommands going out either way, but
it's more a matter of throttling resource usage.

Honestly, I would say play around with it for a little bit and monitor the
results - there is most likely a good sweet spot in there somewhere - bonus
if you can make it a stored procedure to mitigate the parsing overhead.



On Thu, Nov 7, 2013 at 9:23 AM, Dan Gould <d...@chill.com> wrote:

>  Thanks--that's what I was wondering.  So, if I understand you correctly,
> it sounds like a single
>     SELECT ... WHERE foo in (k items);
>
> can tie up k threads rather than 1 thread per node which can starve other
> tasks on a cluster.  AFAICT, there's no way to say "this query should be
> limited to only __% of the resources on each node".  Alas, for every other
> table in our system I've figured out nice ways to denormalize and turn
> complex things in to single queries.  But this one I can't.
>
> So--alas--it sounds like my best answer will be to issue lots of smaller
> queries with pauses in-between.  (Or to look at patching C* to be smarter
> about resource management, but I'm not-at-all familiar with C* internals so
> this may be impractical at the moment.)
>
>
> On 11/6/13 8:26 PM, Aaron Morton wrote:
>
>  If one big query doesn't cause problems
>
>
>  Every row you read becomes a (roughly) RF number of tasks in the
> cluster. If you ask for 100 rows in one query it will generate 300 tasks
> that are processed by the read thread pool which as a default of 32
> threads. If you ask for a lot of rows and the number of nodes in low there
> is a chance the client starve others as they wait for all the tasks to be
> completed. So i tend to like asking for fewer rows.
>
>  Cheers
>
>  -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
>  Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>  On 7/11/2013, at 12:19 pm, Dan Gould <d...@chill.com> wrote:
>
>  Thanks Nate,
>
> I assume 10k is the return limit.  I don't think I'll ever get close to
> 10k matches to the IN query.  That said, you're right: to be safe I'll
> increase the limit to match the number of items on the IN.
>
> I didn't know CQL supported stored procedures, but I'll take a look.  I
> suppose my question was asking about parsing overhead, however.  If one big
> query doesn't cause problems--which I assume it wouldn't since there can be
> multiple threads parsing and I assume C* is smart about memory when
> accumulating results--I'd much rather do that.
>
> Dan
>
> On 11/6/13 3:05 PM, Nate McCall wrote:
>
> Unless you explicitly set a page size (i'm pretty sure the query is
> converted to a paging query automatically under the hood) you will get
> capped at the default of 10k which might get a little weird semantically.
> That said, you should experiment with explicit page sizes and see where it
> gets you (i've not tried this yet with an IN clause - would be real curious
> to hear how it worked).
>
>  Another thing to consider is that it's a pretty big statement to parse
> every time. You might want to go the (much) smaller batch route so these
> can be stored procedures? (another thing I havent tried with IN clause -
> don't see why it would not work though).
>
>
>
>
> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <d...@chill.com> wrote:
>
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>>
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long
>> tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>>
>> Would it be best to do:
>>
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>>
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>>
>> 3) Something else?
>>
>> My guess is that (1) is fine and that the only worry is too much data
>> returned (which won't be a problem in this case), but I wanted to check
>> that it's not a C* anti-pattern before.
>>
>> [Conversely, is a batch insert with up to 30k items ok?]
>>
>> Thanks,
>> Dan
>>
>>
>
>
>  --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: CQL 'IN' predicate

Reply via email to