Thanks! I wanted to avoid repeating f1, f2, f3 in class B. I wonder whether the encoders/decoders work if I use mixins
On Tue, Jan 15, 2019 at 7:57 PM <kevin.r.mell...@gmail.com> wrote: > Hi Mohit, > > > > I’m not sure that there is a “correct” answer here, but I tend to use > classes whenever the input or output data represents something meaningful > (such as a domain model object). I would recommend against creating many > temporary classes for each and every transformation step as that may be > difficult to maintain over time. > > > > Using *withColumn* statements will continue to work, and you don’t need > to cast to your output class until you’ve setup all tranformations. > Therefore, you can do things like: > > > > case class A (f1, f2, f3) > > case class B (f1, f2, f3, f4, f5, f6) > > > > ds_a = spark.read.csv(“path”).as[A] > > ds_b = ds_a > > .withColumn(“f4”, someUdf) > > .withColumn(“f5”, someUdf) > > .withColumn(“f6”, someUdf) > > .as[B] > > > > Kevin > > > > *From:* Mohit Jaggi <mohitja...@gmail.com> > *Sent:* Tuesday, January 15, 2019 1:31 PM > *To:* user <user@spark.apache.org> > *Subject:* dataset best practice question > > > > Fellow Spark Coders, > > I am trying to move from using Dataframes to Datasets for a reasonably > large code base. Today the code looks like this: > > > > df_a= read_csv > > df_b = df.withColumn ( some_transform_that_adds_more_columns ) > > //repeat the above several times > > > > With datasets, this will require defining > > > > case class A { f1, f2, f3 } //fields from csv file > > case class B { f1, f2, f3, f4 } //union of A and new field added by > some_transform_that_adds_more_columns > > //repeat this 10 times > > > > Is there a better way? > > > > Mohit. >