Based on a recent inquiry and a recent thread of my own, and the coming
support for wide rows, I'll focus in on this question that I feel needs
better documentation of recommended best practice:

When can the COUNT(*) aggregate row-counting function be used?

Accept for relatively small or narrow queries, it seems to have a
propensity for timing out.

>From what I gather (and without specific testing myself), SELECT COUNT(*)
seems to be warranted for use cases where:

1. The table is relatively small, or
2. The number of rows selected by the WHERE clause is relatively small, due
to a combination of a relatively small number of partitions and a
relatively narrow slice of rows in a partition.

One can always approximate a larger SELECT by performing multiple selects,
each specifying a limited range of tokens in the WHERE clause.

But, even if the number of partitions selected is relatively small, it is
possible that the partitions might be wide so that the number of rows is
not so small.

And with CASSANDRA-11206
<https://issues.apache.org/jira/browse/CASSANDRA-11206>, very large
partitions (2 GB) could have a non-trivial number of rows.

So, an open question is how small is reasonable for COUNT. Thousands? Tens
of thousands? Only Hundreds?

Obviously it does depend on your particular data and hardware, but just on
a general basis.

In any case, does anyone have any specific experiences to share as to what
recommendations they would have for the use of COUNT?

A companion question is whether COUNT(column_name) has the same limitations
and recommendations. It does have to actually fetch the column values as
opposed to simply determining the existence of the row, but how
consequential that additional processing is, I couldn't say.

-- Jack Krupansky

Reply via email to