SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Nathan McCarthy
Hey, Spark Submit adds maven central & spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the s

Spark SQL DATE_ADD function - Spark 1.3.1 & 1.4.0

2015-06-17 Thread Nathan McCarthy
Hi guys, Running with a parquet backed table in hive ‘dim_promo_date_curr_p' which has the following data; scala> sqlContext.sql("select * from pz.dim_promo_date_curr_p").show(3) 15/06/18 00:53:21 INFO ParseDriver: Parsing command: select * from pz.dim_promo_date_curr_p 15/06/18 00:53:21 INFO P

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Nathan McCarthy
ace condition bug and just filed https://issues.apache.org/jira/browse/SPARK-8406 to track this. Will deliver a fix ASAP and this will be included in 1.4.1. Best, Cheng On 6/16/15 12:30 AM, Nathan McCarthy wrote: Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We

Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-16 Thread Nathan McCarthy
Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no problems with Spark 1.3. When trying to save a data frame with 569610608 rows. dfc.write.format("parquet").save(“/data/map_parquet_file") We get random results between runs. Caching the data frame in memor

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-16 Thread Nathan McCarthy
ou ran Spark with ? On Wed, Apr 15, 2015 at 11:27 AM, Nathan McCarthy mailto:nathan.mccar...@quantium.com.au>> wrote: Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan mailto:nathan.mccar...@quantium.com.au>> Date: Wednesday, 15 April 2015 1:57 pm

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
PI when running on YARN - Spark 1.3.0 Tried with 1.3.0 release (built myself) & the most recent 1.3.1 Snapshot off the 1.3 branch. Haven't tried with 1.4/master. From: Wang, Daoyuan [daoyuan.w...@intel.com<mailto:daoyuan.w...@intel.com>] Sent:

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
Tried with 1.3.0 release (built myself) & the most recent 1.3.1 Snapshot off the 1.3 branch. Haven't tried with 1.4/master. From: Wang, Daoyuan [daoyuan.w...@intel.com] Sent: Wednesday, April 15, 2015 5:22 PM To: Nathan McCarthy; user@spark.apache.or

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan mailto:nathan.mccar...@quantium.com.au>> Date: Wednesday, 15 April 2015 1:57 pm To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: SparkSQL JDBC Datasources API when runni

SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Hi guys, Trying to use a Spark SQL context’s .load(“jdbc", …) method to create a DF from a JDBC data source. All seems to work well locally (master = local[*]), however as soon as we try and run on YARN we have problems. We seem to be running into problems with the class path and loading up the

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-15 Thread Nathan McCarthy
apache.org>> Subject: Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats? On 1/11/15 1:40 PM, Nathan McCarthy wrote: Thanks Cheng & Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll definitely start using while loo

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-10 Thread Nathan McCarthy
ost of decompressing the column buffers vs. accessing a bunch of uncompressed primitive objects. On Fri, Jan 9, 2015 at 6:59 AM, Cheng Lian mailto:lian.cs@gmail.com>> wrote: Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Che

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
p; functional Scala! I think I’m doing something dumb… Is there something I should be doing to get faster performance on MapPartitions on SchemaRDDs? Is there some unwrapping going on in the background that catalyst does in a smart way that I’m missing? Cheers, ~N Nathan McCarthy QU

SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
m doing something dumb… Is there something I should be doing to get faster performance on MapPartitions on SchemaRDDs? Is there some unwrapping going on in the background that catalyst does in a smart way that I’m missing? Cheers, ~N Nathan McCarthy QUANTIUM Level 25, 8 Chifley, 8-12