Re: Proper use of COUNT

DuyHai Doan Tue, 19 Apr 2016 14:58:05 -0700

Jack, you should have a look at my blog post, I did some testing with
various value for paging using aggregate functions:
http://www.doanduyhai.com/blog/?p=2015


On Tue, Apr 19, 2016 at 10:23 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> BTW, I did notice this Jira for setting a client timeout for cqlsh, so
> maybe this is the culprit for that user:
>
> CASSANDRA-7516 - Configurable client timeout for cqlsh
> https://issues.apache.org/jira/browse/CASSANDRA-7516
>
> Or, should they actually be using the --request-timeout command line
> option for cqlsh?
>
> -- Jack Krupansky
>
> On Tue, Apr 19, 2016 at 4:56 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Sylvain & Tyler, this Jira is for a user reporting a timeout for SELECT
>> COUNT(*) using 3.3:
>> https://issues.apache.org/jira/browse/CASSANDRA-11566
>>
>> I'll let one of you guys follow up on that. I mean, I thought it was
>> timing out die to the amount of data, but you guys are saying that paging
>> should make that not a problem. Or is there a timeout in cqlsh simply
>> because the operation is slow - as opposed to the server reporting an
>> internal timeout?
>>
>> Thanks.
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 19, 2016 at 12:45 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>>
>>>
>>> On Tue, Apr 19, 2016 at 11:32 AM, Jack Krupansky <
>>> jack.krupan...@gmail.com> wrote:
>>>
>>>>
>>>> Are the queries sent from the coordinator to other nodes sequencing
>>>> through partitions in token order and that's what allows the coordinator to
>>>> dedupe with just a single page at a time? IOW, if a target node responds
>>>> with a row from token t, then by definition there will be no further rows
>>>> returned from that node with a token less than t?
>>>>
>>>
>>> That's correct.  The internal paging for aggregation queries is exactly
>>> the same as the normal "client facing" paging.
>>>
>>>
>>>>
>>>> And if I understand all of this so far, this means that for 3.x COUNT
>>>> (and other aggregate functions) are "safe but may be slow" (paraphrasing
>>>> Sylvain.) Is this for 3.0 and later or some other 3.x (or even some 2.x)?
>>>>
>>>
>>> I think count(*) started using paging internally in 2.1, but I'm having
>>> trouble finding the jira ticket.  It could have been 2.0.
>>>
>>> The new aggregation functions in 2.2 utilize the same code path.
>>>
>>>
>>>>
>>>> There remains the question of recommended usage for COUNT. I think my
>>>> two proposed guidelines remain valid (ignoring the old timeout issue), with
>>>> the only remaining question about how large a row count is advisable for
>>>> "decent" request latency. 1,000? 10,000? Granted, it depends on the
>>>> specific data and hardware, but I'm thinking that the guidance should be
>>>> that you should only use COUNT(*) for no more than "low thousands" of rows
>>>> unless you are willing to accept it both being very slow and very
>>>> disruptive to normal cluster health. IOW, it's more like a batch analytics
>>>> operation than a real-time operation. An occasional administrative query to
>>>> measure table size should be okay, but common use for OLTP should be
>>>> restricted to relatively narrow slices or row counts... I think. Feedback
>>>> welcome.
>>>>
>>>> The upcoming support for 2GB partitions will be interesting, but the
>>>> same guidance should cover, I think. Maybe the numeric upper bound might be
>>>> a bit higher since only a single partition is involved, but if processing
>>>> many thousands of rows will remain time consuming, it sounds like that
>>>> should be treated more as a batch-style OLAP operation rather than a
>>>> real-time OLTP operation... I think.
>>>>
>>>
>>> I think this is decent guidance.  I'll also clarify that aggregation
>>> functions should only be used on single partitions if you expect to get a
>>> response back with reasonable latency.  Full table scans are still
>>> expensive, even when they're wrapped in an aggregation function.
>>>
>>> If count(*) is too slow, the standard alternatives are:
>>>  - counters
>>>  - a static count that's periodically refreshed by a batch/background
>>> process
>>>  - LWT increments on an int column
>>>  - an external datastore like redis
>>>
>>> Obviously, each of these has a different set of tradeoffs.
>>>
>>> --
>>> Tyler Hobbs
>>> DataStax <http://datastax.com/>
>>>
>>
>>
>

Re: Proper use of COUNT

Reply via email to