Re: CQL 'IN' predicate

Dan Gould Thu, 07 Nov 2013 07:24:21 -0800

Thanks--that's what I was wondering. So, if I understand you correctly,it sounds like a single

    SELECT ... WHERE foo in (k items);

can tie up k threads rather than 1 thread per node which can starveother tasks on a cluster. AFAICT, there's no way to say "this queryshould be limited to only __% of the resources on each node". Alas, forevery other table in our system I've figured out nice ways todenormalize and turn complex things in to single queries. But this oneI can't.

So--alas--it sounds like my best answer will be to issue lots of smallerqueries with pauses in-between. (Or to look at patching C* to besmarter about resource management, but I'm not-at-all familiar with C*internals so this may be impractical at the moment.)


On 11/6/13 8:26 PM, Aaron Morton wrote:

If one big query doesn't cause problems
Every row you read becomes a (roughly) RF number of tasks in thecluster. If you ask for 100 rows in one query it will generate 300tasks that are processed by the read thread pool which as a default of32 threads. If you ask for a lot of rows and the number of nodes inlow there is a chance the client starve others as they wait for allthe tasks to be completed. So i tend to like asking for fewer rows.
Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com
On 7/11/2013, at 12:19 pm, Dan Gould <d...@chill.com<mailto:d...@chill.com>> wrote:
Thanks Nate,
I assume 10k is the return limit. I don't think I'll ever get closeto 10k matches to the IN query. That said, you're right: to be safeI'll increase the limit to match the number of items on the IN.
I didn't know CQL supported stored procedures, but I'll take a look.I suppose my question was asking about parsing overhead, however. Ifone big query doesn't cause problems--which I assume it wouldn'tsince there can be multiple threads parsing and I assume C* is smartabout memory when accumulating results--I'd much rather do that.
Dan

On 11/6/13 3:05 PM, Nate McCall wrote:
Unless you explicitly set a page size (i'm pretty sure the query isconverted to a paging query automatically under the hood) you willget capped at the default of 10k which might get a little weirdsemantically. That said, you should experiment with explicit pagesizes and see where it gets you (i've not tried this yet with an INclause - would be real curious to hear how it worked).
Another thing to consider is that it's a pretty big statement toparse every time. You might want to go the (much) smaller batchroute so these can be stored procedures? (another thing I haventtried with IN clause - don't see why it would not work though).
On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <d...@chill.com<mailto:d...@chill.com>> wrote:
    I was wondering if anyone had a sense of performance/best practices
    around the 'IN' predicate.

    I have a list of up to potentially ~30k keys that I want to look
    up in a
    table (typically queries will have <500, but I worry about the
    long tail).  Most
    of them will not exist in the table, but, say, about 10-20% will.

    Would it be best to do:

    1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ......
    uuid30000);

    2) Split into smaller batches--
    for group_of_100 in all_30000:
       // ** Issue in parallel or block after each one??
       SELECT fields FROM table WHERE id in (group_of_100 uuids);

    3) Something else?

    My guess is that (1) is fine and that the only worry is too much
    data returned (which won't be a problem in this case), but I
    wanted to check that it's not a C* anti-pattern before.

    [Conversely, is a batch insert with up to 30k items ok?]

    Thanks,
    Dan




--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: CQL 'IN' predicate

Reply via email to