best practices to implement library of custom transformations of Dataframe/Dataset

Valery Khamenya Mon, 18 Jun 2018 12:34:37 -0700

Dear Spark gurus,

*Question*: what way would you recommend to shape a library of custom
transformations for Dataframes/Datasets?


*Details*: e.g., consider we need several custom transformations over the
Dataset/Dataframe instances. For example injecting columns, apply specially
tuned row filtering, lookup-table based replacements, etc.

I'd consider basically 2 options:

1) implicits! create a class that looks like derived from Dataset/Dataframe
and then and implement the transformations as its methods

or

2) implement the transformations as stand-alone functions

The use of first approach leads to such beautiful code:

val result = inputDataframe
  .myAdvancedFiter(params)
  .myAdvancedReplacement(params)
  .myColumnInjection(params)
  .mySomethingElseTransformation(params)
  .andTheFinalGoodies(params)

nice!

whereas the second option will lead to this:

val result = andTheFinalGoodies(
  mySomethingElseTransformation(
    myColumnInjection(
      myAdvancedReplacement(
        inputDataframe.myAdvancedFiter(params),
        params),
      params),
    params),
  params)

terrible! ;)

So, ideally it would be nice to learn how to implement Option 1. Luckily
there are different approaches for this:
https://stackoverflow.com/questions/32585670/what-is-the-best-way-to-define-custom-methods-on-a-dataframe

However in reality such transformations rely on

  import spark.implicits._

and I never seen solution on how to pass SparkContext to such library
classes and safely use it in there. This article shows, that it is not that
straight-forward thing:

https://docs.azuredatabricks.net/spark/latest/rdd-streaming/tips-for-running-streaming-apps-in-databricks.html

Said that, I still need a wisdom of Spark community to get over this.

P.S. and a good Spark application "boilerplate" with a separately
implemented library of Dataframe/Dataset transformations relying on "import
spark.implicits._" is still wanted badly!

best regards
--
Valery

best practices to implement library of custom transformations of Dataframe/Dataset

Reply via email to