Re: dataset best practice question

Mohit Jaggi Fri, 18 Jan 2019 09:08:34 -0800

Thanks! I wanted to avoid repeating f1, f2, f3 in class B. I wonder whether
the encoders/decoders work if I use mixins


On Tue, Jan 15, 2019 at 7:57 PM <kevin.r.mell...@gmail.com> wrote:

> Hi Mohit,
>
>
>
> I’m not sure that there is a “correct” answer here, but I tend to use
> classes whenever the input or output data represents something meaningful
> (such as a domain model object). I would recommend against creating many
> temporary classes for each and every transformation step as that may be
> difficult to maintain over time.
>
>
>
> Using *withColumn* statements will continue to work, and you don’t need
> to cast to your output class until you’ve setup all tranformations.
> Therefore, you can do things like:
>
>
>
> case class A (f1, f2, f3)
>
> case class B (f1, f2, f3, f4, f5, f6)
>
>
>
> ds_a = spark.read.csv(“path”).as[A]
>
> ds_b = ds_a
>
>   .withColumn(“f4”, someUdf)
>
>   .withColumn(“f5”, someUdf)
>
>   .withColumn(“f6”, someUdf)
>
>   .as[B]
>
>
>
> Kevin
>
>
>
> *From:* Mohit Jaggi <mohitja...@gmail.com>
> *Sent:* Tuesday, January 15, 2019 1:31 PM
> *To:* user <user@spark.apache.org>
> *Subject:* dataset best practice question
>
>
>
> Fellow Spark Coders,
>
> I am trying to move from using Dataframes to Datasets for a reasonably
> large code base. Today the code looks like this:
>
>
>
> df_a= read_csv
>
> df_b = df.withColumn ( some_transform_that_adds_more_columns )
>
> //repeat the above several times
>
>
>
> With datasets, this will require defining
>
>
>
> case class A { f1, f2, f3 } //fields from csv file
>
> case class B { f1, f2, f3, f4 } //union of A and new field added by
> some_transform_that_adds_more_columns
>
> //repeat this 10 times
>
>
>
> Is there a better way?
>
>
>
> Mohit.
>

Re: dataset best practice question

Reply via email to