Thanks for your input Soma , but I am actually looking to understand the differences and not only on the performance.
---- On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote ---- If you want to measure optimisation in terms of time taken , then here is an idea :) public class MyClass { public static void main(String args[]) throws InterruptedException { long start = System.currentTimeMillis(); // replace with your add column code // enough data to measure Thread.sleep(5000); long end = System.currentTimeMillis(); int timeTaken = 0; timeTaken = (int) (end - start ); System.out.println("Time taken " + timeTaken) ; } } On Sat, 4 Apr 2020, 19:07 , <em...@yeikel.com> wrote: Dear Community, Recently, I had to solve the following problem “for every entry of a Dataset[String], concat a constant value” , and to solve it, I used built-in functions : val data = Seq("A","b","c").toDS scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit(" "),lit("concat"))).select("valueconcat").explain() == Physical Plan == LocalTableScan [valueconcat#161] As an alternative , a much simpler version of the program is to use map, but it adds a serialization step that does not seem to be present for the version above : scala> data.map(e=> s"$e concat").explain == Physical Plan == *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#92] +- *(1) MapElements <function1>, obj#91: java.lang.String +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String +- LocalTableScan [value#12] Is this over-optimization or is this the right way to go? As a follow up , is there any better API to get the one and only column available in a DataSet[String] when using built-in functions? “col(data.columns.head)” works but it is not ideal. Thanks!