Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-13 Thread Khalid Mammadov
There's also PolarSpark that offers PySpark API with Polars engine https://github.com/khalidmammadov/polarspark PS. It's also in PyPi On Wed, 12 Feb 2025, 19:56 Sem, wrote: > DuckDB provides "PySpark syntax" on top of fast single node engine: > > https://duckdb.org/docs/clients/python/spark_ap

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-12 Thread Sem
DuckDB provides "PySpark syntax" on top of fast single node engine: https://duckdb.org/docs/clients/python/spark_api.html As I remember, DuckDB is much faster than pandas on a single node and it already provides a spark-compatible API. On 2/10/25 1:02 PM, José Müller wrote: Hi all, I'm new

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Sean Owen
I don't think this makes sense, or lacks motivation. You want teams to convert pandas code to Pyspark syntax, only to run it on pandas? why? just run the pandas code in a larger job that also uses Spark if you like, or within UDFs. If you remove this assumption that people need to convert to Pyspa

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
I would suggest using some feature like Lambda Architecture which is flexible enough. This is a dated diagram that illustrates better The architecture will consist of two main components plus Mapping 1) Batch Feature Engineering with Spark: Purpose: Process large datasets to generate features for

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Tornike Gurgenidze
"I’d love to hear thoughts from the community on this idea, and *if there's a better approach to solving this issue" * In case you aren't aware, there are already tools out there (outside of spark) that try to accomplish something similar. Probably your best options are either ibis

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi Wenchen, "Interesting, so this is PySpark on pandas which is the reverse of Koalas." yes, exactly "If performance is the only problem, maybe we can improve local-mode Spark performance to be on par with these single-node engines. + @Hyukjin Kwon " Not sure if we could gain much performance, e

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Wenchen Fan
Interesting, so this is PySpark on pandas which is the reverse of Koalas. If performance is the only problem, maybe we can improve local-mode Spark performance to be on par with these single-node engines. + @Hyukjin Kwon On Mon, Feb 10, 2025 at 8:40 PM José Müller wrote: > Hi Mitch, > > All yo

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi Mitch, All you said is well understood, but I believe you are missing the point, the proposal is not to break Spark ways of processing, but to use spark as a wrapper to process pandas, same as `pandas_api()`, but the inverse. Most of the cases to serve ML models require low latency (ms) and id

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
Regardless there are technical limitations here. For example pandas operates in-memory on a single machine like the driver, making it unsuitable for large-scale datasets that exceed the memory or processing capacity of a single node. Unlike pySpark, pandas cannot distribute workloads across a clust

[PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread José Müller
Hi all, I'm new to the Spark community—please let me know if this isn’t the right forum for feature proposals. *About Me:* I’ve spent over 10 years in data roles, from engineering to machine learning and analytics. A recurring challenge I've encountered is the disconnect between data engineering