RE: Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
Grouping is applied in the aggregation. From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thu, Mar 10, 2016 13:56 To: Gerhard Fiedler Cc: user@spark.apache.org Subject: Re: Partitioning to speed up processing? Are they entire data set aggregates or is

Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
I have a number of queries that result in a sequence Filter > Project > Aggregate. I wonder whether partitioning the input table makes sense. Does Aggregate benefit from a partitioned input? If so, what partitions would be most useful (related to the aggregations)? Do Filter and Project preserv

RE: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Gerhard Fiedler
/create-cluster.html) doesn’t have a similar argument. Gerhard From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: Wed, Mar 09, 2016 04:28 To: Wang, Daoyuan Cc: Gerhard Fiedler; user@spark.apache.org Subject: Re: How to add a custom jar file to the Spark driver? Hi Gerhard, I just stumbled upon

How to add a custom jar file to the Spark driver?

2016-03-08 Thread Gerhard Fiedler
We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but we want to add a custom jar file to the driver. When running on a local one-node standalone cluster, we just use spark.driver.extraClassPath and everything works: spark-submit --conf spark.driver.extraClassPath=/path