Looks like this guy (Brian Hess) wrote a script to split the token range and 
run count(*) on each subrange:

https://github.com/brianmhess/cassandra-count 
<https://github.com/brianmhess/cassandra-count>

- Max

> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote:
> 
> SELECT COUNT(*) probably works (with internal paging) on many datasets with 
> enough time and assuming you don’t have any partitions that will kill you.
> 
> No, it doesn’t count extra replicas / duplicates.
> 
> The old way to do this (before paging / fetch size) was to use manual paging 
> based on tokens/clustering keys:
> 
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html 
> <https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html> – SELECT’s 
> WHERE clause can use token(), which is what you’d want to use to page through 
> the whole token space. 
> 
> You could, in theory, issue thousands of queries in parallel, all for 
> different token ranges, and then sum the results. That’s what something like 
> spark would be doing. If you want to determine rows per node, limit the token 
> range to that owned by the node (easier with 1 token than vnodes, with vnodes 
> repeat num_tokens times).

Reply via email to