Re: best practices to implement library of custom transformations of Dataframe/Dataset

Georg Heiler Mon, 18 Jun 2018 12:51:01 -0700

I believe explicit is better than implicits, however as you mentioned the
notation is very nice.


Therefore, I suggest
https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c
to
use df.transform(myFunction)

Valery Khamenya <khame...@gmail.com> schrieb am Mo., 18. Juni 2018 um
21:34 Uhr:

> Dear Spark gurus,
>
> *Question*: what way would you recommend to shape a library of custom
> transformations for Dataframes/Datasets?
>
> *Details*: e.g., consider we need several custom transformations over the
> Dataset/Dataframe instances. For example injecting columns, apply specially
> tuned row filtering, lookup-table based replacements, etc.
>
> I'd consider basically 2 options:
>
> 1) implicits! create a class that looks like derived from
> Dataset/Dataframe and then and implement the transformations as its methods
>
> or
>
> 2) implement the transformations as stand-alone functions
>
> The use of first approach leads to such beautiful code:
>
> val result = inputDataframe
>   .myAdvancedFiter(params)
>   .myAdvancedReplacement(params)
>   .myColumnInjection(params)
>   .mySomethingElseTransformation(params)
>   .andTheFinalGoodies(params)
>
> nice!
>
> whereas the second option will lead to this:
>
> val result = andTheFinalGoodies(
>   mySomethingElseTransformation(
>     myColumnInjection(
>       myAdvancedReplacement(
>         inputDataframe.myAdvancedFiter(params),
>         params),
>       params),
>     params),
>   params)
>
> terrible! ;)
>
> So, ideally it would be nice to learn how to implement Option 1. Luckily
> there are different approaches for this:
> https://stackoverflow.com/questions/32585670/what-is-the-best-way-to-define-custom-methods-on-a-dataframe
>
> However in reality such transformations rely on
>
>   import spark.implicits._
>
> and I never seen solution on how to pass SparkContext to such library
> classes and safely use it in there. This article shows, that it is not that
> straight-forward thing:
>
>
> https://docs.azuredatabricks.net/spark/latest/rdd-streaming/tips-for-running-streaming-apps-in-databricks.html
>
> Said that, I still need a wisdom of Spark community to get over this.
>
> P.S. and a good Spark application "boilerplate" with a separately
> implemented library of Dataframe/Dataset transformations relying on "import
> spark.implicits._" is still wanted badly!
>
> best regards
> --
> Valery
>

Re: best practices to implement library of custom transformations of Dataframe/Dataset

Reply via email to