What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
I have a job that is running into intermittent errors with [SparkDriver] java.lang.OutOfMemoryError: Java heap space. Before I was getting this error I was getting errors saying the result size exceed the spark.driver.maxResultSize. This does not make any sense to me, as there are no actions in m

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
; > There has to be a difference in classpaths in yarn-client and yarn-cluster > mode. Perhaps a good starting point would be to print classpath as a first > thing in SimpleApp.main. It should give clues around why it works in > yarn-cluster mode. > > Thanks, > Aniket > >

java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
Hi, I have a problem trying to get a fairly simple app working which makes use of native avro libraries. The app runs fine on my local machine and in yarn-cluster mode, but when I try to run it on EMR yarn-client mode I get the error below. I'm aware this is a version problem, as EMR runs an ear

Re: SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon
.take(20).foreach(println) On Thu, Jun 4, 2015 at 12:05 PM Tom Seddon wrote: > Hi, > > I've worked out how to use explode on my input avro dataset with the > following structure > root > |-- pageViewId: string (nullable = false) > |-- components: array

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon
Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||-- loadT

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon
I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json p

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon
Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell wrote: > Can you list what your fix was so others can benefit? > > On W

Re: Broadcast failure with variable size of ~ 500mb with "key already cancelled ?"

2014-11-11 Thread Tom Seddon
Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same thing. I'm working with multiple broadcast variables in PySpark, most of which are small, but one of around 4.5GB, using 10 workers at 31GB memory each and driver with same spec. It's not running out of m