Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Tornike Gurgenidze
re definitions, and optimize performance across both batch >>>>> and real-time workflows. >>>>> >>>>> >>>>> *The Proposal:* >>>>> >>>>> *Introduce an API that allows PySpark syntax while processing >>>>> DataFrame using either Spark or Pandas depending on the session context.* >>>>> >>>>> >>>>> *Simple, but intuitive example:* >>>>> >>>>> import pyspark.sql.functions as F >>>>> >>>>> def silver(bronze_df): >>>>> return ( >>>>> bronze_df >>>>> .withColumnRenamed("bronze_col", "silver_col") >>>>> ) >>>>> >>>>> def gold(silver_df): >>>>> return ( >>>>> silver_df >>>>> .withColumnRenamed("silver_col", "gold_col") >>>>> .withColumn("gold_col", F.col("gold_col") + 1) >>>>> ) >>>>> >>>>> def features(gold_df): >>>>> return ( >>>>> gold_df >>>>> .withColumnRenamed("gold_col", "feature_col") >>>>> .withColumn("feature_col", F.col("feature_col") + 1) >>>>> ) >>>>> >>>>> # With the Spark Session (normal way of using PySpark) >>>>> spark = SparkSession.builder.master("local[1]").getOrCreate() >>>>> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)]) >>>>> silver_df = silver(bronze_df) >>>>> gold_df = gold(silver_df) >>>>> features_df = features(gold_df) >>>>> features_df.show() >>>>> >>>>> # Proposed "Pandas Spark Session" >>>>> spark = SparkSession.builder.as_pandas.getOrCreate() >>>>> # This would execute the same transformations using Pandas under the hood. >>>>> >>>>> This would enable teams to share the same codebase while choosing the >>>>> most efficient processing engine. >>>>> >>>>> We've built and experimented in different data teams with a public >>>>> library, *flyipe >>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*, >>>>> that uses pandas_api() transformations, but can run using Pandas >>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>, >>>>> but it still requires ML teams to manage separate pipelines for Spark >>>>> dependencies. >>>>> >>>>> I’d love to hear thoughts from the community on this idea, and *if >>>>> there's a better approach to solving this issue*. >>>>> >>>>> Thanks, >>>>> José Müller >>>>> >>>> >>> >>> -- >>> José Müller >>> >> > > -- > José Müller > -- Tornike Gurgenidze

Spark and Arrow Flight

2022-07-26 Thread Tornike Gurgenidze
Hi, I would like to know if the community here would be interested in a project I started a little while back (rough, but mostly functional prototype here - https://github.com/tokoko/SparkFlightSql). There are 2 components in the repo I would like to have your feedback about. SparkFlightSql - Arr