Thanks Christophe,
we didn't want to add too many moving parts but is sound like a good
solution. do you have any reference / link that I can look at ?

Cheers
Avi

On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz <
christo...@instaclustr.com> wrote:

> Hi Avi,
>
> Have you thought of using Spark for that work? If you collocate the spark
> workers on each Cassandra nodes, the spark-cassandra connector will split
> automatically the token range for you in such a way that each spark worker
> only hit the Cassandra local node. This will also be done in parallel.
> Should be much faster that way.
>
> Cheers,
> Christophe
>
>
> On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com> wrote:
>
>> Thank you very much , one question . you wrote that I do not need
>> distinct here since it's a part from the primary key. but only the
>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I take
>> the last token and feed it back as you showed wouldn't I get overlapping
>> boundaries ?
>>
>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com> wrote:
>>
>>> You should be able to fairly efficiently iterate all the partition keys
>>> like:
>>>
>>> select id, token(id) from table where token(id) >= -9204925292781066255
>>> limit 1000;
>>>  id                                         | system.token(id)
>>> --------------------------------------------+----------------------
>>> ...
>>>  0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686
>>>
>>> Take the last token you receive and feed it back in, skipping duplicates
>>> from the previous page (on the unlikely chance that you have two ID's with
>>> a token collision on the page boundary):
>>>
>>> select id, token(id) from table where token(id) >=
>>> -7821793584824523686 limit 1000;
>>>  id                                         | system.token(id)
>>> --------------------------------------------+---------------------
>>> ...
>>>  0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339
>>>
>>> Continue until you have no more results.  You don't really need distinct
>>> here: it's part of your primary key, it must already be distinct.
>>>
>>> If you want to parallelize it, split the ring into *n* ranges and
>>> include it as an upper bound for each segment.
>>>
>>> select id, token(id) from table where token(id) >= -9204925292781066255
>>> AND token(id) < $rangeUpperBound limit 1000;
>>>
>>>
>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com> wrote:
>>>
>>>> I need to get all unique keys (not the complete primary key, just the
>>>> partition key) in order to aggregate all the relevant records of that key
>>>> and apply some calculations on it.
>>>>
>>>> *CREATE TABLE my_table (
>>>>
>>>>     id text,
>>>>
>>>>     timestamp bigint,
>>>>
>>>>     value double,
>>>>
>>>>     PRIMARY KEY (id, timestamp) )*
>>>>
>>>> I know that to query like this
>>>>
>>>> *SELECT DISTINCT id FROM my_table *
>>>>
>>>> is not very efficient but how about the approach presented here 
>>>> <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/>
>>>>  sending queries in parallel and using the token
>>>>
>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 
>>>> AND token(id) <= -9223372036854775808; *
>>>>
>>>> *or I can just maintain another table with the unique keys *
>>>>
>>>> *CREATE TABLE id_only ( id text,
>>>>
>>>>     PRIMARY KEY (id) )*
>>>>
>>>> but I tend not to since it is error prone and will enforce other 
>>>> procedures to maintain data integrity between those two tables .
>>>>
>>>> any ideas ?
>>>>
>>>> Thanks
>>>>
>>>> Avi
>>>>
>>>>
>>
>
>
> --
>
>
> *Christophe Schmitz*
> *Director of consulting EMEA*
>

Reply via email to