re definitions, and optimize performance across both batch
>>>>> and real-time workflows.
>>>>>
>>>>>
>>>>> *The Proposal:*
>>>>>
>>>>> *Introduce an API that allows PySpark syntax while processing
>>>>> DataFrame using either Spark or Pandas depending on the session context.*
>>>>>
>>>>>
>>>>> *Simple, but intuitive example:*
>>>>>
>>>>> import pyspark.sql.functions as F
>>>>>
>>>>> def silver(bronze_df):
>>>>> return (
>>>>> bronze_df
>>>>> .withColumnRenamed("bronze_col", "silver_col")
>>>>> )
>>>>>
>>>>> def gold(silver_df):
>>>>> return (
>>>>> silver_df
>>>>> .withColumnRenamed("silver_col", "gold_col")
>>>>> .withColumn("gold_col", F.col("gold_col") + 1)
>>>>> )
>>>>>
>>>>> def features(gold_df):
>>>>> return (
>>>>> gold_df
>>>>> .withColumnRenamed("gold_col", "feature_col")
>>>>> .withColumn("feature_col", F.col("feature_col") + 1)
>>>>> )
>>>>>
>>>>> # With the Spark Session (normal way of using PySpark)
>>>>> spark = SparkSession.builder.master("local[1]").getOrCreate()
>>>>> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)])
>>>>> silver_df = silver(bronze_df)
>>>>> gold_df = gold(silver_df)
>>>>> features_df = features(gold_df)
>>>>> features_df.show()
>>>>>
>>>>> # Proposed "Pandas Spark Session"
>>>>> spark = SparkSession.builder.as_pandas.getOrCreate()
>>>>> # This would execute the same transformations using Pandas under the hood.
>>>>>
>>>>> This would enable teams to share the same codebase while choosing the
>>>>> most efficient processing engine.
>>>>>
>>>>> We've built and experimented in different data teams with a public
>>>>> library, *flyipe
>>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
>>>>> that uses pandas_api() transformations, but can run using Pandas
>>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
>>>>> but it still requires ML teams to manage separate pipelines for Spark
>>>>> dependencies.
>>>>>
>>>>> I’d love to hear thoughts from the community on this idea, and *if
>>>>> there's a better approach to solving this issue*.
>>>>>
>>>>> Thanks,
>>>>> José Müller
>>>>>
>>>>
>>>
>>> --
>>> José Müller
>>>
>>
>
> --
> José Müller
>
--
Tornike Gurgenidze
Hi,
I would like to know if the community here would be interested in a project
I started a little while back (rough, but mostly functional prototype here
- https://github.com/tokoko/SparkFlightSql). There are 2 components in the
repo I would like to have your feedback about.
SparkFlightSql - Arr