Here is a quick code sample I can come up with : case class Input(ID:String, Name:String, PhoneNumber:String, Address: String) val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789", "houseNo:StreetNo:City:State:Zip"))).toDF() val formatAddress = udf { (s: String) => s.split(":").mkString("-")} val outputDF = df.withColumn("FormattedAddress", formatAddress(df("Address")))
-Raghav On Thu, Oct 15, 2015 at 10:34 PM, Hao Wang <billhao.l...@gmail.com> wrote: > Hi, > > I have searched around but could not find a satisfying answer to this > question: what is the best way to do a complex transformation on a > dataframe column? > > For example, I have a dataframe with the following schema and a function > that has pretty complex logic to format addresses. I would like to use the > function to format each address and store the output as an additional > column in the dataframe. What is the best way to do it? Use Dataframe.map? > Define a UDF? Some code example would be appreciated. > > Input dataframe: > root > |-- ID: string (nullable = true) > |-- Name: string (nullable = true) > |-- PhoneNumber: string (nullable = true) > |-- Address: string (nullable = true) > > Output dataframe: > root > |-- ID: string (nullable = true) > |-- Name: string (nullable = true) > |-- PhoneNumber: string (nullable = true) > |-- Address: string (nullable = true) > |-- FormattedAddress: string (nullable = true) > > The function for format addresses: > def formatAddress(address: String): String > > > Best regards, > Hao Wang > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >