Re: Performance of loading parquet files into case classes in Spark

2016-08-30 Thread Steve Loughran
On 29 Aug 2016, at 20:58, Julien Dumazert mailto:julien.dumaz...@gmail.com>> wrote: Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset

Re: Performance of loading parquet files into case classes in Spark

2016-08-29 Thread Julien Dumazert
Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset with map and Dataframes sum // 35.372s df.as[A].map(_.fieldToSum).agg(sum("value")).coll

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Maciej Bryński
Hi Julien, I thought about something like this: import org.apache.spark.sql.functions.sumdf.as[A].map(_.fieldToSum).agg(sum("value")).collect() To try using Dataframes aggregation on Dataset instead of reduce. Regards, Maciek 2016-08-28 21:27 GMT+02:00 Julien Dumazert : > Hi Maciek, > > I've

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Julien Dumazert
Hi Maciek, I've tested several variants for summing "fieldToSum": First, RDD-style code: df.as[A].map(_.fieldToSum).reduce(_ + _) df.as[A].rdd.map(_.fieldToSum).sum() df.as[A].map(_.fieldToSum).rdd.sum() All around 30 seconds. "reduce" and "sum" seem to have the same performance, for this use ca

Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert : > df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) I think reduce and sum has very different performance. Did you try sql.functions.sum ? Or of you want to benchmark access to Row object then count() function will be better idea. Regards,