Thanks for your input Soma , but I am actually looking to understand the 
differences and not only on the performance. 


---- On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote ----


If you want to  measure optimisation in terms of time taken , then here is an 
idea  :)  




public class MyClass {
    public static void main(String args[]) 
    throws InterruptedException
    {
          long start  =  System.currentTimeMillis();
      
// replace with your add column code
// enough data to measure 
       Thread.sleep(5000);
  
     long end  = System.currentTimeMillis();
     
       int timeTaken = 0;
      timeTaken = (int) (end  - start );


      System.out.println("Time taken  " + timeTaken) ;
    }
}


On Sat, 4 Apr 2020, 19:07 , <em...@yeikel.com> wrote:


Dear Community,

 

Recently, I had to solve the following problem “for every entry of a 
Dataset[String], concat a constant value” , and to solve it, I used built-in 
functions :

 

val data = Seq("A","b","c").toDS

 

scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit(" 
"),lit("concat"))).select("valueconcat").explain()

== Physical Plan ==

LocalTableScan [valueconcat#161]

 

As an alternative , a much simpler version of the program is to use map, but it 
adds a serialization step that does not seem to be present for the version 
above :

 

scala> data.map(e=> s"$e concat").explain

== Physical Plan ==

*(1) SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true, false) AS value#92]

+- *(1) MapElements <function1>, obj#91: java.lang.String

   +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String

      +- LocalTableScan [value#12]

 

Is this over-optimization or is this the right way to go?  

 

As a follow up , is there any better API to get the one and only column 
available in a DataSet[String] when using built-in functions? 
“col(data.columns.head)” works but it is not ideal.

 

Thanks!

Reply via email to