Spark 2.0 and Yarn

2016-08-28 Thread Srikanth Sampath
Hi, With SPARK-11157, the big fat assembly jar build was removed. Has anyone used spark.yarn.archive - the alternative provided and successfully deployed Spark on a Yarn cluster. If so, what does the archive contain. What should be the minimal set. Any suggestion is greatly appreciated. Thanks

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Julien Dumazert
Hi Maciek, I've tested several variants for summing "fieldToSum": First, RDD-style code: df.as[A].map(_.fieldToSum).reduce(_ + _) df.as[A].rdd.map(_.fieldToSum).sum() df.as[A].map(_.fieldToSum).rdd.sum() All around 30 seconds. "reduce" and "sum" seem to have the same performance, for this use ca

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Maciej BryƄski
Hi Julien, I thought about something like this: import org.apache.spark.sql.functions.sumdf.as[A].map(_.fieldToSum).agg(sum("value")).collect() To try using Dataframes aggregation on Dataset instead of reduce. Regards, Maciek 2016-08-28 21:27 GMT+02:00 Julien Dumazert : > Hi Maciek, > > I've