Re: [VOTE] SPIP: JDBC Driver for Spark Connect

2025-09-22 Thread Nimrod Ofek
I'll raise an issue with this- I don't think the user that uses jdbc to Spark should know if he is working with Spark connect or regular Spark The jdbc driver should know how to work with connect with fallback maybe, but the user doesn't care if he is getting Spark connect or not... Regards, N

Re: Why Snappy Compression?

2025-08-26 Thread Nimrod Ofek
Hi, >From my experience, and from all the benchmarks I did and read- snappy provides much bigger file size compared to zstd, while cpu usage is similar for both - in most cases not really noticeable. We switched to ZSTD and our CPU usage did not increase in a noticeable manner (maybe an increase

Re: Question Regarding Spark Dependencies in Scala

2025-06-04 Thread Nimrod Ofek
almost what spark-parent > does but I don't think it works that way). It feels inaccurate, and not > helpful for most use cases, but I don't see a major problem with it > actually. Your dependency graph gets a lot bigger with stuff you don't > need, but it'

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
o not directly use, no. > This is like any other multi-module project in the Maven/SBT ecosystem. > > On Tue, Jun 3, 2025 at 1:59 PM Nimrod Ofek wrote: > >> It does not compile if I don't add spark -sql. >> In usual projects I'd agree with you, but since Spar

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
They just need to configure spark-provided :) Thanks, Nimrod On Tue, Jun 3, 2025 at 8:57 PM Sean Owen wrote: > For sure, but, that is what Maven/SBT do. It resolves your project > dependencies, looking at all their transitive dependencies, according to > some rules. > You do not need to r

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
on - as >> long as it's already provided... They just need to configure spark-provided >> :) >> >> Thanks, >> Nimrod >> >> >> On Tue, Jun 3, 2025 at 8:57 PM Sean Owen wrote: >> >>> For sure, but, that is what Maven/SBT do. It resolv

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
> I don't think it's intended that you pull in all submodules for any one > app, although you could. > I don't know if there's some common subset that is both large and commonly > used. > > Maven/SBT already pull in all transitive dependencies. > > On Tue

Re: Question Regarding Spark Dependencies in Scala

2025-06-03 Thread Nimrod Ofek
ll include them all - that's a lot easier to maintain... Thanks! Nimrod On Sun, Jun 1, 2025 at 12:23 AM Nimrod Ofek wrote: > No > K8s deployment, nothing special. > I just don't see why when I'm developing and compiling or let's say > upgrade from spark

Re: Question Regarding Spark Dependencies in Scala

2025-05-31 Thread Nimrod Ofek
lysis | GDPR > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Sat, 31 May 2025 at 19:47, Nimrod Ofek wrote: > >> Hi everyone, >> >> Apologies if this is a basic question—I’ve searched ar

Question Regarding Spark Dependencies in Scala

2025-05-31 Thread Nimrod Ofek
Hi everyone, Apologies if this is a basic question—I’ve searched around but haven’t found a clear answer. I'm currently developing a Spark application using Scala, and I’m looking for a way to include all the JARs typically bundled in a standard Spark installation as a single provided dependency.

Re: [DISCUSS] Spark - How to improve our release processes

2025-02-06 Thread Nimrod Ofek
gt; versions with the CI, as the PySpark code or doc generation code may not be > compatible with the old versions after 6 months. > > It would be better if we automate this process, but I don't have a good > idea now. > > On Tue, Feb 4, 2025 at 6:32 PM Nimrod Ofek wrote: &g

Re: [DISCUSS] Spark - How to improve our release processes

2025-02-04 Thread Nimrod Ofek
cker images, and Github >>> cache for workflow cache. But we can use Github artifacts to store any kind >>> of package (even Docker images in the ghcr), which is fully accepted by >>> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure, >>> ..

Re: [Connect] Spark connect documentation clarification request

2025-02-03 Thread Nimrod Ofek
e: > Hi Nimrod, > > We are working on this as we speak. > > There is already a PR out for the extensions use case: > https://github.com/apache/spark/pull/49604 > > Kind regards, > Herman > > On Mon, Feb 3, 2025 at 10:10 AM Nimrod Ofek wrote: > >> Hi, >> &

[Connect] Spark connect documentation clarification request

2025-02-03 Thread Nimrod Ofek
Hi, In https://spark.apache.org/spark-connect/ - at the bottom it says: Check out the guide on migrating from Spark JVM to Spark Connect to learn more about how to write code that works with Spark Connect. Also, check out how to build Spark Connect custom extensions to learn how to use specialize

Re: Spark 4.0 timeline

2025-01-07 Thread Nimrod Ofek
Hi, Yes of course there are usually few RCs until the final release, thanks for the clarification. Just wanted to make sure that this timeline is valid and the awaited 4.0 version can be expected in the next 2-5 months or so to plan internally for the 4.0 adoption... Thanks! בתאריך יום ג׳, 7 בינ

Spark 4.0 timeline

2025-01-07 Thread Nimrod Ofek
Hi all, Does the timeline here - https://issues.apache.org/jira/browse/SPARK-44111 - talking about Code Freeze for Spark 4.0 in about a week, and release in about a Month and a half or so? Thanks, Nimrod

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-11-24 Thread Nimrod Ofek
Hi Bjørn, In spark it is called dynamic resource allocation: https://spark.apache.org/docs/3.5.1/configuration.html#dynamic-allocation You can't just use k8s autoscaler since you need the driver to manage the new worker nodes to know the new executors so they will get work from it.. Thanks, N

Re: Spark Docker image with added packages

2024-10-17 Thread Nimrod Ofek
ite build tool, declare a dependency on your required >> packages. >> 2. Write your Dockerfile, with or without the Spark binaries inside it. >> 3. Using your build tool to copy the dependencies to a location that the >> Docker daemon can access. >> 4. Copy the dependenc

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-10-04 Thread Nimrod Ofek
Hi, Did anyone do any search about the GraphX API in Gitlab/Github and different search engines to see if they are searched and actually used - or we are considering it not used because the API wasn't changed? Thanks! Nimrod On Mon, Sep 30, 2024 at 9:02 PM Holden Karau wrote: > I think it has

Re: [DISCUSS] Communicating over Slack instead of e-mails

2024-08-14 Thread Nimrod Ofek
ouldn't push > community members to stay in Slack (or similar) to keep updated for the > discussions, votes, etc. Checking inbox is a lot less overhead and a lot > less stressful. Github issue might be giving similar UX, but not the > services intended for instant chat. > > On

Re: [DISCUSS] Communicating over Slack instead of e-mails

2024-08-14 Thread Nimrod Ofek
SPIP, core features, > breaking changes, etc) and their VOTE threads, must be guarded by infinite > retention and be exposed to public (and easier to find in anytime, even for > future community member). > > > On Mon, Aug 12, 2024 at 7:46 PM Nimrod Ofek wrote: > >> Hi all, >

Re: [DISCUSS] Communicating over Slack instead of e-mails

2024-08-12 Thread Nimrod Ofek
> The Python community successfully migrated <https://discuss.python.org> from > mailing lists to a modern forum <https://discourse.org>. If we are > considering migrating off these lists (assuming that’s even possible), then > that’s what I would suggest. > > &g

[DISCUSS] Communicating over Slack instead of e-mails

2024-08-12 Thread Nimrod Ofek
Hi all, Many other oss projects (some of which include some of the participants of this mailing list I'm sure) are using Slack as a more modern communication channel. I find Slack to be more appropriate these days, easier to navigate through groups, easier to see context of different threads and

Spark - range join

2024-07-17 Thread Nimrod Ofek
Hi all, Is there an open source equivalent for range join that Databricks has ? Thanks! Nimrod

Re: [DISCUSS] Auto scaling support for structured streaming

2024-07-12 Thread Nimrod Ofek
Hi, Anyone? Scaling for different loads in a structured streaming app should be a trivial requirement for users... Thanks! Nimrod בתאריך יום ג׳, 9 ביולי 2024, 10:20, מאת Nimrod Ofek ‏: > PMC members, can someone please push this thing forward? > > Thanks! > Nimrod > > בתאר

Re: [DISCUSS] Auto scaling support for structured streaming

2024-07-09 Thread Nimrod Ofek
t; > Cheers, > > Pavan > > > > > On Mon, Jul 8, 2024 at 10:33 AM Nimrod Ofek wrote: > >> Hi, >> >> Thanks Pavan. >> >> I think that the change is very important due to the amount of Spark >> structured streaming apps running today ou

Re: [DISCUSS] Auto scaling support for structured streaming

2024-07-08 Thread Nimrod Ofek
ed by PMC members, so not sure about > the timeline at this point. > > Thanks, > > Pavan > > > > On Thu, Jul 4, 2024 at 10:46 AM Nimrod Ofek wrote: > >> Hi, >> >> I remember there was a discussion about better supporting auto scaling >> for structured st

[DISCUSS] Auto scaling support for structured streaming

2024-07-04 Thread Nimrod Ofek
Hi, I remember there was a discussion about better supporting auto scaling for structured streaming. Is there anything happening with that for the upcoming Spark 4.0 release? Will there be support for auto scaling (at least on K8s) spark structured streaming apps? Thanks, Nimrod

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-10 Thread Nimrod Ofek
>>> wrote: >>>> >>>>> I agree with the idea of a versionless programming guide. But one >>>>> thing we need to make sure of is we give clear messages for things that >>>>> are >>>>> only available in a new version. My proposal i

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Nimrod Ofek
nked would then require + 1 PRs, > opposed to 1 PR in the versionless programming guide world. > > Neil > > On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek wrote: > >> Hi, >> >> While I think that the documentation needs a lot of improvement and >> important detail

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Nimrod Ofek
Hi, While I think that the documentation needs a lot of improvement and important details are missing - and detaching the documentation from the main project can help iterating faster on documentation specific tasks, I don't think we can nor should move to versionless documentation. Documentation

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a thread to discuss improvements to our release processes. I'll Start by raising some questions that probably should have answers to start the discussion: 1. What is currently running in GitHub Actions? 2. Who currently h

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
t up. > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek wrote: > >> Hi, >> >> Sorry for the novice question, Wenchen - the release is done manually >>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Sorry for the novice question, Wenchen - the release is done manually from a laptop? Not using a CI CD process on a build server? Thanks, Nimrod On Tue, May 7, 2024 at 8:50 PM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen, I think that usually a good practice with public api and with internal api that has big impact and a lot of usage is to ease in changes by providing defaults to new parameters that will keep former behaviour in a method with the previous signature with deprecation notice, and

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Nimrod Ofek
+1 (non-binding) p.s How do I become binding? Thanks, Nimrod On Tue, Apr 30, 2024 at 10:53 AM Ye Xianjin wrote: > +1 > Sent from my iPhone > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > >  > +1 > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > >  > To add more color: > > Spark data

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
gdom >>> >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaim

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek wrote: > >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
l to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 14:38, Nimrod

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
th any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek wrote: >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between Spark native tables vs hive tables and why each should be used... Thanks Nimrod בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > I see a statement made as below and I quote > > "

Support Avro rolling version upgrades using schema manager

2024-04-13 Thread Nimrod Ofek
Hi, Currently, Avro records are supported in Spark - but with the limitation that we must specify the input and output schema versions. For writing out an avro record that is fine - but for reading avro records, that is usually a problem since there are upgrades and changes - and the current situa