Hi guys.

I've been struggling for the last days to find a reliable and stable way to
count keys in a thrift column family.

My idea is to basically iterate the whole ring using the token function, as
documented here:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in batches
of 10000 records

The only corner case is that if there were more than 10000 records in a
single partition (not the case, but the program should still handle it) it
explores the partition in depth by getting all records for that particular
token (see below). In the end, all keys are saved into a hash to guarantee
uniqueness. The count of unique keys is always different (and random,
sometimes more keys, sometimes less are retrieved) and, of course, I'm sure
no activity is going on in that cf.

I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM

the column family structure is

CREATE TABLE tbl (
    key blob,
    column1 ascii,
    value blob,
    PRIMARY KEY(key, column1)
)

and I'm running the following script

connection = open_cql_connection
results = connection.execute("SELECT token(key), key FROM tbl LIMIT 10000")

keys_hash = {} // Hash to save the keys to guarantee uniqueness
last_token = nil
token = nil

while results != nil
  results.each do |row|
    keys_hash[row['key']] = true
    token = row['token(key)']
  end
  if token == last_token
    results = connection.execute("SELECT token(key), key FROM tbl WHERE
token(key) = #{token}")
  else
    results = connection.execute("SELECT token(key), key FROM tbl WHERE
token(key) >= #{token} LIMIT 10000")
  end
  last_token = token
end

puts keys.keys.count

What am I missing?

Thanks!

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

Reply via email to