Re: ASF board report draft for February 2025

2025-02-10 Thread Pavan Kotikalapudi
Hi Adam, Thanks for bringing up this initiative again to spark committers. I can resonate with that. It has been close to 2 years since this feature is operational for us(internally) and has been waiting in the apache/spark codebase for some love. It has soo many people (non-committers) intereste

Re: ASF board report draft for February 2025

2025-02-10 Thread Mich Talebzadeh
Hi all, ok this DRA has already had a thread here Vote on Dynamic resource allocation for structured streaming [SPARK-24815] I recall I asked a committer to open the PR and it was opened and closed.because of inactivity. Pavan Kotikalapudi was working on it Happy to chip in and help where I can

Re: ASF board report draft for February 2025

2025-02-10 Thread Jungtaek Lim
Let's move the discussion to the other thread, as it's not relevant to the board report. tl;dr. Spark has a crazily large codebase and has multiple layers. SS is on top of SQL and SQL is on top of CORE. DRA is bound to CORE, especially used for specific resource managers like YARN (maybe we had de

RE: ASF board report draft for February 2025

2025-02-10 Thread Adam Hobbs
Thanks for the reply. I am not fussed how the situation is addressed really, but I am just trying to keep the initiative alive. This isn’t the first time I have tried to rescue it. The feature would deliver great cost savings and possibly greater performance for my use case. After the disappoi

Re: ASF board report draft for February 2025

2025-02-10 Thread Jungtaek Lim
Thanks Adam for your email. I started to look at these changes when proposed but I am not familiar with DRA. It needed a non-trivial context building for me to be effective which I could not prioritize. I asked my team members to also review and they were involved, but even they lacked context on

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Sean Owen
I don't think this makes sense, or lacks motivation. You want teams to convert pandas code to Pyspark syntax, only to run it on pandas? why? just run the pandas code in a larger job that also uses Spark if you like, or within UDFs. If you remove this assumption that people need to convert to Pyspa

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
I would suggest using some feature like Lambda Architecture which is flexible enough. This is a dated diagram that illustrates better The architecture will consist of two main components plus Mapping 1) Batch Feature Engineering with Spark: Purpose: Process large datasets to generate features for

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Tornike Gurgenidze
"I’d love to hear thoughts from the community on this idea, and *if there's a better approach to solving this issue" * In case you aren't aware, there are already tools out there (outside of spark) that try to accomplish something similar. Probably your best options are either ibis

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi Wenchen, "Interesting, so this is PySpark on pandas which is the reverse of Koalas." yes, exactly "If performance is the only problem, maybe we can improve local-mode Spark performance to be on par with these single-node engines. + @Hyukjin Kwon " Not sure if we could gain much performance, e

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Wenchen Fan
Interesting, so this is PySpark on pandas which is the reverse of Koalas. If performance is the only problem, maybe we can improve local-mode Spark performance to be on par with these single-node engines. + @Hyukjin Kwon On Mon, Feb 10, 2025 at 8:40 PM José Müller wrote: > Hi Mitch, > > All yo

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi Mitch, All you said is well understood, but I believe you are missing the point, the proposal is not to break Spark ways of processing, but to use spark as a wrapper to process pandas, same as `pandas_api()`, but the inverse. Most of the cases to serve ML models require low latency (ms) and id

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
Regardless there are technical limitations here. For example pandas operates in-memory on a single machine like the driver, making it unsuitable for large-scale datasets that exceed the memory or processing capacity of a single node. Unlike pySpark, pandas cannot distribute workloads across a clust

[PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi all, I'm new to the Spark community—please let me know if this isn’t the right forum for feature proposals. *About Me:* I’ve spent over 10 years in data roles, from engineering to machine learning and analytics. A recurring challenge I've encountered is the disconnect between data engineering