DataStax Enterprise bundles spark and spark connector on the DSE nodes and handles much of the plumbing work (and monitoring, etc.). Worth a look.
Sean Durity From: Avi Levi [mailto:a...@indeni.com] Sent: Tuesday, August 22, 2017 2:46 AM To: user@cassandra.apache.org Subject: Re: Getting all unique keys Thanks Christophe, we will definitely consider that in the future. On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz <christo...@instaclustr.com<mailto:christo...@instaclustr.com>> wrote: Hi Avi, The spark-project documentation is quite good, as well as the spark-cassandra-connector github project, which contains some basic examples you can easily get inspired from. A few random advice you might find usefull: - You will want one spark worker on each node, and a spark master on either one of the node, or on a separate node. - Pay close attention at your port configuration (firewall) as the spark error log does not always give you the right hint. - Pay close attention at your heap size. Make sure to configure your heap size such as Cassandra heap size + spark heap size < your node memory (take into account Cassandra off heap usage if enabled, OS etc...) - If your Cassandra data center is used in production, make sure you throttle read / write from Spark, pay attention to your latencies, and consider using a separate analytic cassandra data center if you get serious with Spark. - More or less everyone I know find that writing spark jobs in scala is natural, while writing them in java is painful :D Getting spark running will be a bit of an investment at the beginning, but overall you will find out it allows you to run queries you can't naturally run in Cassandra, like the one you described. Cheers, Christophe On 21 August 2017 at 16:16, Avi Levi <a...@indeni.com<mailto:a...@indeni.com>> wrote: Thanks Christophe, we didn't want to add too many moving parts but is sound like a good solution. do you have any reference / link that I can look at ? Cheers Avi On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz <christo...@instaclustr.com<mailto:christo...@instaclustr.com>> wrote: Hi Avi, Have you thought of using Spark for that work? If you collocate the spark workers on each Cassandra nodes, the spark-cassandra connector will split automatically the token range for you in such a way that each spark worker only hit the Cassandra local node. This will also be done in parallel. Should be much faster that way. Cheers, Christophe On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com<mailto:a...@indeni.com>> wrote: Thank you very much , one question . you wrote that I do not need distinct here since it's a part from the primary key. but only the combination is unique (PRIMARY KEY (id, timestamp) ) . also if I take the last token and feed it back as you showed wouldn't I get overlapping boundaries ? On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com<mailto:migh...@gmail.com>> wrote: You should be able to fairly efficiently iterate all the partition keys like: select id, token(id) from table where token(id) >= -9204925292781066255 limit 1000; id | system.token(id) --------------------------------------------+---------------------- ... 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686 Take the last token you receive and feed it back in, skipping duplicates from the previous page (on the unlikely chance that you have two ID's with a token collision on the page boundary): select id, token(id) from table where token(id) >= -7821793584824523686 limit 1000; id | system.token(id) --------------------------------------------+--------------------- ... 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339 Continue until you have no more results. You don't really need distinct here: it's part of your primary key, it must already be distinct. If you want to parallelize it, split the ring into n ranges and include it as an upper bound for each segment. select id, token(id) from table where token(id) >= -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000; On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com<mailto:a...@indeni.com>> wrote: I need to get all unique keys (not the complete primary key, just the partition key) in order to aggregate all the relevant records of that key and apply some calculations on it. CREATE TABLE my_table ( id text, timestamp bigint, value double, PRIMARY KEY (id, timestamp) ) I know that to query like this SELECT DISTINCT id FROM my_table is not very efficient but how about the approach presented here<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scylladb.com_2017_02_13_efficient-2Dfull-2Dtable-2Dscans-2Dwith-2Dscylla-2D1-2D6_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=1sB228OLAkVlZ44o5jSF7SbV2AZOSaLZNAAudK1lQFM&s=9TX6yetLXapKzXWLGI6vNlLwoyNp61mK5GYNhMrHz9k&e=> sending queries in parallel and using the token SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 AND token(id) <= -9223372036854775808; or I can just maintain another table with the unique keys CREATE TABLE id_only ( id text, PRIMARY KEY (id) ) but I tend not to since it is error prone and will enforce other procedures to maintain data integrity between those two tables . any ideas ? Thanks Avi -- Christophe Schmitz Director of consulting EMEA -- Christophe Schmitz Director of consulting EMEA ________________________________ The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.