On 29 Aug 2016, at 20:58, Julien Dumazert
mailto:julien.dumaz...@gmail.com>> wrote:
Hi Maciek,
I followed your recommandation and benchmarked Dataframes aggregations on
Dataset. Here is what I got:
// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)
// Dataset
Hi Maciek,
I followed your recommandation and benchmarked Dataframes aggregations on
Dataset. Here is what I got:
// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)
// Dataset with map and Dataframes sum
// 35.372s
df.as[A].map(_.fieldToSum).agg(sum("value")).coll
Hi Julien,
I thought about something like this:
import
org.apache.spark.sql.functions.sumdf.as[A].map(_.fieldToSum).agg(sum("value")).collect()
To try using Dataframes aggregation on Dataset instead of reduce.
Regards,
Maciek
2016-08-28 21:27 GMT+02:00 Julien Dumazert :
> Hi Maciek,
>
> I've
Hi Maciek,
I've tested several variants for summing "fieldToSum":
First, RDD-style code:
df.as[A].map(_.fieldToSum).reduce(_ + _)
df.as[A].rdd.map(_.fieldToSum).sum()
df.as[A].map(_.fieldToSum).rdd.sum()
All around 30 seconds. "reduce" and "sum" seem to have the same performance,
for this use ca
2016-08-27 15:27 GMT+02:00 Julien Dumazert :
> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)
I think reduce and sum has very different performance.
Did you try sql.functions.sum ?
Or of you want to benchmark access to Row object then count() function
will be better idea.
Regards,