Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
Yes yes true. I just wonder if somebody took measurements for all different types of problems in the Big Data area and created some scientific analysis how much time is wasted on serialization deserialization to support the figure of 80% ;) > On 24 Jun 2016, at 10:35, Jacek Laskowski wrote

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Pranav Nakhe
Hello, The question came from the point that dataframe uses tungsten improvements with usage of catalyst optimizer. So there would be some additional work spark does to convert an RDD to dataframe to use the optimizations/improvements available to dataframes. Regards, Pranav On Fri, Jun 24, 20

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jacek Laskowski
Hi Jorn, You can measure the time for ser/deser yourself using web UI or SparkListeners. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jun 24, 2016 at 10:14

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
I would push the Spark people to provide equivalent functionality . In the end it is a deserialization/serialization process which should not be done back and forth because it is one of the more costly aspects during processing. It needs to convert Java objects to a binary representation. It is

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Mich Talebzadeh
Hi, I do not profess at all that this this reply has any correlation with the advanced people :) However, in general a Data Frame adds the two-dimensional structure (table) to RDD which is basically a construct that cannot be optimised due to non-schema structure of RDD. Now converting RDD to DF

Re: Cost of converting RDD's to dataframe and back

2016-06-23 Thread Jacek Laskowski
Hi, I've been asking a similar question myself too! Thanks for sending it to the mailing list! Going from a RDD to a Dataset triggers a job to calculate a schema (unless the RDD is RDD[Row]). I *think* that transitioning from a Dataset to a RDD is almost a no op since a Dataset requires more to

Cost of converting RDD's to dataframe and back

2016-06-23 Thread pan
Hello, I am trying to understand the cost of converting an RDD to Dataframe and back. Would a conversion back and forth very frequently cost performance. I do observe that some operations like join are implemented very differently for RDD (pair) and Dataframe so trying to figure out the cose of