Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Boumalhab, Chris
This looks good to me! I’m considering tuple too if we have theta. Theta can be priority, but given that tuple is just an extension, it doesn’t hurt to add down the line.

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Boumalhab, Chris
Hi Ryan, Thanks for the reply! Would you recommend I put in a JIRA ticket and consider developing this? I’m not familiar with the process. Chris From: Ryan Berti Date: Tuesday, June 3, 2025 at 6:13 PM To: "cboum...@amazon.com.invalid" Cc: "dev@spark.apache.org" Subject: RE: [EXTERNAL] [DISCU

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Menelaos Karavelas
Following what Ryan did for HLL sketches, I would also add an aggregate expression for unions as the aggregate version of the binary union expression. The expressions that Ryan added are: hll_sketch_agg hll_union hll_union_agg hll_sketch_estimate Following the same naming convention I would prob

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Boumalhab, Chris
I think something like this could work: theta_sketch_agg(col) to build the sketch theta_sketch_union(sketch1, sketch2) to union the sketches theta_sketch_estimate(sketch) or theta_sketch_estimate_count(sketch) to estimate count … Something similar can be done for tuple support. Let me know what

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Menelaos Karavelas
Yes, HLL sketches do not support the operations you mention, and this is actually a good reason to add other types of sketches. Ryan beat me to answering :) Datasketches is already a dependency, so it should make some things easier. Regarding the user facing functionality, could you please be m

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Sean Owen
Yes, you're just saying that if your app depends on Foo, and Spark depends on Foo, then ideally you depend on the exact same version Spark uses. Otherwise it's up to Maven/SBT to pick one or the other version, which might or might not be suitable. Yes, dependency conflicts are painful to deal with

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Boumalhab, Chris
Hi Menelaos, Thanks for pointing that out. HLL sketches do not support set operations such as intersection or difference. Tuple sketches would also allow value aggregation for the same key. For those reasons, I don’t believe HLL is enough. Chris From: Menelaos Karavelas Date: Tuesday, June 3,

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
You don't add dependencies you don't use- but you do need to declare dependencies you do use, and if the platform you are running use a specific version you need to use that version- you can't break comparability. Since spark uses a lot of dependencies - I don't expect the user to check if spark us

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Menelaos Karavelas
Hello Chris. HLL sketches from the same project (Apache DataSketches) have already been integrated in Spark. How does your proposal fit given what I just mentioned? - Menelaos > On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris > wrote: > > Hi all, > > I’d like to start a discussion about addi

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Ryan Berti
Hi Chris, We integrated DataSketches into Spark when we introduced the hll_sketch_* UDFs - see the PR from 2023 for more info. I'm sure there'd be interest in exposing other types of sketches, and I bet there'd be some potential for code-reuse between t

[DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

2025-06-03 Thread Boumalhab, Chris
Hi all, I’d like to start a discussion about adding support for [Apache DataSketches](https://datasketches.apache.org/) — specifically, Theta and Tuple Sketches — to Spark SQL and DataFrame APIs. ## Motivation These sketches allow scalable approximate set operations (like distinct count, union

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
I'll five an example: If I have a project that reads from Kafka topic avro messages - and writes them to Delta tables, I would expect to set only: libraryDependencies ++= Seq( "io.delta" %% "delta-spark" % deltaVersion % Provided, "org.apache.spark" %% "spark-avro" % sparkVersion, "org.apac

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Sean Owen
Do you have an example of what you mean? Yes, a deployment of Spark has all the modules. You do not need to (should not in fact) deploy Spark code with your Spark app for this reason. You still need to express dependencies on the Spark code that your app uses at *compile* time however, in order to

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
It does not compile if I don't add spark -sql. In usual projects I'd agree with you, but since Spark comes complete with all dependencies unlike other programs where you deploy certain dependencies only- I see no reason for users to select specific dependencies that are already bundled in the spark

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Sean Owen
For sure, but, that is what Maven/SBT do. It resolves your project dependencies, looking at all their transitive dependencies, according to some rules. You do not need to re-declare Spark's dependencies in your project, no. I'm not quite sure what you mean. On Tue, Jun 3, 2025 at 12:55 PM Nimrod O

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Sean Owen
I think Spark, like any project, is large enough to decompose into modules, and it has been. A single app almost surely doesn't need all the modules. So yes you have to depend on the modules you actually need, and I think that's normal. See Jackson for example. (spark-sql is not necessary as it's r

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
Thanks Sean. There are other dependencies that you need to align with Spark if you need to use them as well - like Guava, Jackson etc. I find them more difficult to use - because you need to go to Spark repo to check the correct version used - and if there are upgrades between versions you need to

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Sean Owen
I think this is already how it works. Most apps would depend on just spark-sql (which depends on spark-core, IIRC). Maybe some optionally pull in streaming or mllib. I don't think it's intended that you pull in all submodules for any one app, although you could. I don't know if there's some common

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
Hi all, Sorry for bumping this again - just trying to understand if it's worth adding a small feature for this - I think it can help Spark users and Spark libraries upgrade and support Spark versions a lot easier :) If instead of adding many provided dependencies we'll have one that will include t