Hey everyone! I'm re-sending this e-mail, now with a PR proposal (
https://github.com/apache/spark/pull/35045 if you want to take a look at
the code with a couple of examples). The proposed change includes only a
new class that would extend only the Python API without doing any change to
the underlying scala code. The benefit would be that the new code only
extends previous functionality without breaking any existing application
code, allowing pyspark users to try it out and see if it turns out to
be useful. Hyukjin Kwon <https://github.com/HyukjinKwon> commented that a
drawback with this would be that, if we do this, it would be hard to
deprecate later the `DynamicDataFrame` API. The other option, if we want
this inheritance to be feasible, is to directly implement this "casting"
directly on the `DataFrame` code, so for example it would change from

def limit(self, num: int) -> "DataFrame":
    jdf = self._jdf.limit(num)
    return DataFrame(jdf, self.sql_ctx)

to

def limit(self, num: int) -> "DataFrame":
    jdf = self._jdf.li mit(num)
    return self.__class__(jdf, self.sql_ctx) # type(self) would work as well

This approach would probably need to implement similar changes on the Scala
API as well in order to allow this kind of inheritance on Scala as well
(unfortunately I'm not knowledgable enough in Scala to figure out what the
changes would be exactly)

I wanted to gather your input on this idea, whether you think it can be
helpful or not, and what would be the best strategy, in your opinion, to
pursue it.

Thank you very much!
Pablo

On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain <
pablo.alc...@wildlifestudios.com> wrote:

> tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would make
> it easier to inherit from it while keeping dataframe methods.
>
> Hello everyone. We have been working for a long time with PySpark and more
> specifically with DataFrames. In our pipelines we have several tables, with
> specific purposes, that we usually load as DataFrames. As you might expect,
> there are a handful of queries and transformations per dataframe that are
> done many times, so we thought of ways that we could abstract them:
>
> 1. Functions: using functions that call dataframes and returns them
> transformed. It had a couple of pitfalls: we had to manage the namespaces
> carefully, and also the "chainability" didn't feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched (
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching)
> methods with the regularly done queries inside the DataFrame class. This
> one kept it pyspark-y, but there was no easy way to handle segregated
> namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from
> `DataFrame` and implement the methods there. This one solves all the
> issues, but with a caveat: the chainable methods cast the result explicitly
> to `DataFrame` (see
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> e g). Therefore, everytime you use one of the parent's methods you'd have
> to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>
> In view of these pitfalls we decided to go for a slightly different
> approach, inspired by #3: We created a class called `DynamicDataFrame` that
> overrides the explicit call to `DataFrame` as done in PySpark but instead
> casted dynamically to `self.__class__` (see
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> e g). This allows the fluent methods to always keep the same class, making
> chainability as smooth as it is with pyspark dataframes.
>
> As an example implementation, here's a link to a gist (
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e)
> that implemented dynamically `withColumn` and `select` methods and the
> expected output.
>
> I'm sharing this here in case you feel like this approach can be useful
> for anyone else. In our case it greatly sped up the development of
> abstraction layers and allowed us to write cleaner code. One of the
> advantages is that it would simply be a "plugin" over pyspark, that does
> not modify anyhow already existing code or application interfaces.
>
> If you think that this can be helpful, I can write a PR as a more refined
> proof of concept.
>
> Thanks!
>
> Pablo
>

Reply via email to