Thanks Christophe, we will definitely consider that in the future. On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz < christo...@instaclustr.com> wrote:
> Hi Avi, > > The spark-project documentation is quite good, as well as the > spark-cassandra-connector github project, which contains some basic > examples you can easily get inspired from. A few random advice you might > find usefull: > - You will want one spark worker on each node, and a spark master on > either one of the node, or on a separate node. > - Pay close attention at your port configuration (firewall) as the spark > error log does not always give you the right hint. > - Pay close attention at your heap size. Make sure to configure your heap > size such as Cassandra heap size + spark heap size < your node memory (take > into account Cassandra off heap usage if enabled, OS etc...) > - If your Cassandra data center is used in production, make sure you > throttle read / write from Spark, pay attention to your latencies, and > consider using a separate analytic cassandra data center if you get serious > with Spark. > - More or less everyone I know find that writing spark jobs in scala is > natural, while writing them in java is painful :D > > Getting spark running will be a bit of an investment at the beginning, but > overall you will find out it allows you to run queries you can't naturally > run in Cassandra, like the one you described. > > Cheers, > > Christophe > > On 21 August 2017 at 16:16, Avi Levi <a...@indeni.com> wrote: > >> Thanks Christophe, >> we didn't want to add too many moving parts but is sound like a good >> solution. do you have any reference / link that I can look at ? >> >> Cheers >> Avi >> >> On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz < >> christo...@instaclustr.com> wrote: >> >>> Hi Avi, >>> >>> Have you thought of using Spark for that work? If you collocate the >>> spark workers on each Cassandra nodes, the spark-cassandra connector will >>> split automatically the token range for you in such a way that each spark >>> worker only hit the Cassandra local node. This will also be done in >>> parallel. Should be much faster that way. >>> >>> Cheers, >>> Christophe >>> >>> >>> On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com> wrote: >>> >>>> Thank you very much , one question . you wrote that I do not need >>>> distinct here since it's a part from the primary key. but only the >>>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I >>>> take the last token and feed it back as you showed wouldn't I get >>>> overlapping boundaries ? >>>> >>>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com> >>>> wrote: >>>> >>>>> You should be able to fairly efficiently iterate all the partition >>>>> keys like: >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -9204925292781066255 limit 1000; >>>>> id | system.token(id) >>>>> --------------------------------------------+---------------------- >>>>> ... >>>>> 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686 >>>>> >>>>> Take the last token you receive and feed it back in, skipping >>>>> duplicates from the previous page (on the unlikely chance that you have >>>>> two >>>>> ID's with a token collision on the page boundary): >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -7821793584824523686 limit 1000; >>>>> id | system.token(id) >>>>> --------------------------------------------+--------------------- >>>>> ... >>>>> 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339 >>>>> >>>>> Continue until you have no more results. You don't really need >>>>> distinct here: it's part of your primary key, it must already be distinct. >>>>> >>>>> If you want to parallelize it, split the ring into *n* ranges and >>>>> include it as an upper bound for each segment. >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000; >>>>> >>>>> >>>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com> wrote: >>>>> >>>>>> I need to get all unique keys (not the complete primary key, just the >>>>>> partition key) in order to aggregate all the relevant records of that key >>>>>> and apply some calculations on it. >>>>>> >>>>>> *CREATE TABLE my_table ( >>>>>> >>>>>> id text, >>>>>> >>>>>> timestamp bigint, >>>>>> >>>>>> value double, >>>>>> >>>>>> PRIMARY KEY (id, timestamp) )* >>>>>> >>>>>> I know that to query like this >>>>>> >>>>>> *SELECT DISTINCT id FROM my_table * >>>>>> >>>>>> is not very efficient but how about the approach presented here >>>>>> <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/> >>>>>> sending queries in parallel and using the token >>>>>> >>>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= >>>>>> -9204925292781066255 AND token(id) <= -9223372036854775808; * >>>>>> >>>>>> *or I can just maintain another table with the unique keys * >>>>>> >>>>>> *CREATE TABLE id_only ( id text, >>>>>> >>>>>> PRIMARY KEY (id) )* >>>>>> >>>>>> but I tend not to since it is error prone and will enforce other >>>>>> procedures to maintain data integrity between those two tables . >>>>>> >>>>>> any ideas ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Avi >>>>>> >>>>>> >>>> >>> >>> >>> -- >>> >>> >>> *Christophe Schmitz* >>> *Director of consulting EMEA* >>> >> >> > > > -- > > > *Christophe Schmitz* > *Director of consulting EMEA* >