Hi Avi, Have you thought of using Spark for that work? If you collocate the spark workers on each Cassandra nodes, the spark-cassandra connector will split automatically the token range for you in such a way that each spark worker only hit the Cassandra local node. This will also be done in parallel. Should be much faster that way.
Cheers, Christophe On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com> wrote: > Thank you very much , one question . you wrote that I do not need distinct > here since it's a part from the primary key. but only the combination is > unique (*PRIMARY KEY (id, timestamp) ) .* also if I take the last token > and feed it back as you showed wouldn't I get overlapping boundaries ? > > On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com> wrote: > >> You should be able to fairly efficiently iterate all the partition keys >> like: >> >> select id, token(id) from table where token(id) >= -9204925292781066255 >> limit 1000; >> id | system.token(id) >> --------------------------------------------+---------------------- >> ... >> 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686 >> >> Take the last token you receive and feed it back in, skipping duplicates >> from the previous page (on the unlikely chance that you have two ID's with >> a token collision on the page boundary): >> >> select id, token(id) from table where token(id) >= >> -7821793584824523686 limit 1000; >> id | system.token(id) >> --------------------------------------------+--------------------- >> ... >> 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339 >> >> Continue until you have no more results. You don't really need distinct >> here: it's part of your primary key, it must already be distinct. >> >> If you want to parallelize it, split the ring into *n* ranges and >> include it as an upper bound for each segment. >> >> select id, token(id) from table where token(id) >= -9204925292781066255 >> AND token(id) < $rangeUpperBound limit 1000; >> >> >> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com> wrote: >> >>> I need to get all unique keys (not the complete primary key, just the >>> partition key) in order to aggregate all the relevant records of that key >>> and apply some calculations on it. >>> >>> *CREATE TABLE my_table ( >>> >>> id text, >>> >>> timestamp bigint, >>> >>> value double, >>> >>> PRIMARY KEY (id, timestamp) )* >>> >>> I know that to query like this >>> >>> *SELECT DISTINCT id FROM my_table * >>> >>> is not very efficient but how about the approach presented here >>> <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/> >>> sending queries in parallel and using the token >>> >>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 >>> AND token(id) <= -9223372036854775808; * >>> >>> *or I can just maintain another table with the unique keys * >>> >>> *CREATE TABLE id_only ( id text, >>> >>> PRIMARY KEY (id) )* >>> >>> but I tend not to since it is error prone and will enforce other procedures >>> to maintain data integrity between those two tables . >>> >>> any ideas ? >>> >>> Thanks >>> >>> Avi >>> >>> > -- *Christophe Schmitz* *Director of consulting EMEA*