Hi Carlos,

In CQL, for the cornercase you describe, you could simply do

    SELECT * FROM tbl WHERE key=#{key} LIMIT 1000;

and if it returns 1000 items, you'd iteratively do

    SELECT * FROM tbl WHERE key=#{key} AND column1 >
#{last_col1_in_prev_query} LIMIT 1000;

Also, have a look at fetchSize here:
https://docs.datastax.com/en/developer/java-driver/2.0/java-driver/reference/queryBuilderOverview.html?scroll=queryBuilderOverview__setting-query-options-querybuilder-api

Hope this helps.

Cheers,
Jens

On Thu, Apr 21, 2016 at 5:59 PM Carlos Alonso <i...@mrcalonso.com> wrote:

> Hi guys.
>
> I've been struggling for the last days to find a reliable and stable way
> to count keys in a thrift column family.
>
> My idea is to basically iterate the whole ring using the token function,
> as documented here:
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in
> batches of 10000 records
>
> The only corner case is that if there were more than 10000 records in a
> single partition (not the case, but the program should still handle it) it
> explores the partition in depth by getting all records for that particular
> token (see below). In the end, all keys are saved into a hash to guarantee
> uniqueness. The count of unique keys is always different (and random,
> sometimes more keys, sometimes less are retrieved) and, of course, I'm sure
> no activity is going on in that cf.
>
> I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM
>
> the column family structure is
>
> CREATE TABLE tbl (
>     key blob,
>     column1 ascii,
>     value blob,
>     PRIMARY KEY(key, column1)
> )
>
> and I'm running the following script
>
> connection = open_cql_connection
> results = connection.execute("SELECT token(key), key FROM tbl LIMIT 10000")
>
> keys_hash = {} // Hash to save the keys to guarantee uniqueness
> last_token = nil
> token = nil
>
> while results != nil
>   results.each do |row|
>     keys_hash[row['key']] = true
>     token = row['token(key)']
>   end
>   if token == last_token
>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
> token(key) = #{token}")
>   else
>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
> token(key) >= #{token} LIMIT 10000")
>   end
>   last_token = token
> end
>
> puts keys.keys.count
>
> What am I missing?
>
> Thanks!
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Reply via email to