Thanks Christophe, we didn't want to add too many moving parts but is sound like a good solution. do you have any reference / link that I can look at ?
Cheers Avi On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz < christo...@instaclustr.com> wrote: > Hi Avi, > > Have you thought of using Spark for that work? If you collocate the spark > workers on each Cassandra nodes, the spark-cassandra connector will split > automatically the token range for you in such a way that each spark worker > only hit the Cassandra local node. This will also be done in parallel. > Should be much faster that way. > > Cheers, > Christophe > > > On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com> wrote: > >> Thank you very much , one question . you wrote that I do not need >> distinct here since it's a part from the primary key. but only the >> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I take >> the last token and feed it back as you showed wouldn't I get overlapping >> boundaries ? >> >> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com> wrote: >> >>> You should be able to fairly efficiently iterate all the partition keys >>> like: >>> >>> select id, token(id) from table where token(id) >= -9204925292781066255 >>> limit 1000; >>> id | system.token(id) >>> --------------------------------------------+---------------------- >>> ... >>> 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686 >>> >>> Take the last token you receive and feed it back in, skipping duplicates >>> from the previous page (on the unlikely chance that you have two ID's with >>> a token collision on the page boundary): >>> >>> select id, token(id) from table where token(id) >= >>> -7821793584824523686 limit 1000; >>> id | system.token(id) >>> --------------------------------------------+--------------------- >>> ... >>> 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339 >>> >>> Continue until you have no more results. You don't really need distinct >>> here: it's part of your primary key, it must already be distinct. >>> >>> If you want to parallelize it, split the ring into *n* ranges and >>> include it as an upper bound for each segment. >>> >>> select id, token(id) from table where token(id) >= -9204925292781066255 >>> AND token(id) < $rangeUpperBound limit 1000; >>> >>> >>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com> wrote: >>> >>>> I need to get all unique keys (not the complete primary key, just the >>>> partition key) in order to aggregate all the relevant records of that key >>>> and apply some calculations on it. >>>> >>>> *CREATE TABLE my_table ( >>>> >>>> id text, >>>> >>>> timestamp bigint, >>>> >>>> value double, >>>> >>>> PRIMARY KEY (id, timestamp) )* >>>> >>>> I know that to query like this >>>> >>>> *SELECT DISTINCT id FROM my_table * >>>> >>>> is not very efficient but how about the approach presented here >>>> <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/> >>>> sending queries in parallel and using the token >>>> >>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 >>>> AND token(id) <= -9223372036854775808; * >>>> >>>> *or I can just maintain another table with the unique keys * >>>> >>>> *CREATE TABLE id_only ( id text, >>>> >>>> PRIMARY KEY (id) )* >>>> >>>> but I tend not to since it is error prone and will enforce other >>>> procedures to maintain data integrity between those two tables . >>>> >>>> any ideas ? >>>> >>>> Thanks >>>> >>>> Avi >>>> >>>> >> > > > -- > > > *Christophe Schmitz* > *Director of consulting EMEA* >