Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I'm currently doing something like this in my Spark Streaming program (Java): dStream.foreachRDD((rdd, batchTime) -> { log.info("processing RDD from batch {}", batchTime); // my rdd processing code }); Instead of having my

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD. On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase wrote: > I'm currently doing something like this in my Spark Streaming pr

Extremely slow shuffle writes and large job time fluxuations

2016-07-19 Thread Jon Chase
I'm running into an issue with a pyspark job where I'm sometimes seeing extremely variable job times (20min to 2hr) and very long shuffle times (e.g. ~2 minutes for 18KB/86 records). Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a single m4.10xlarge (40 vCPU, 160GB) executo

Re: About memory leak in spark 1.4.1

2015-09-28 Thread Jon Chase
I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory eventually gets exhausted after about 3 days and the JVM process dies with: # # There is insufficient memory for the Java Runtime Environment to conti

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase
Shahab - This should do the trick until Hao's changes are out: sqlContext.sql("create temporary function foobar as 'com.myco.FoobarUDAF'"); sqlContext.sql("select foobar(some_column) from some_table"); This works without requiring to 'deploy' a JAR with the UDAF in it - just make sure the UDA

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... I see

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase wrote: > Spark 1.3.0, Parquet > > I'm having trouble referencing partition columns in my queries. > > In the following example, 'probeTypeId' is a pa

Spark SQL queries hang forever

2015-03-26 Thread Jon Chase
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like "select count(*) from

Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Spark 1.3.0 Two issues: a) I'm unable to get a "lateral view explode" query to work on an array type b) I'm unable to save an array type to a Parquet file I keep running into this: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq Here's a stack trace from the explo

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
mes the > underlying SQL array is represented by a Scala Seq. Would you mind to open > a JIRA ticket for this? Thanks! > > Cheng > > On 3/27/15 7:00 PM, Jon Chase wrote: > > Spark 1.3.0 > > Two issues: > > a) I'm unable to get a "lateral view explode&

Re: Spark SQL "lateral view explode" doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
that, would you mind to also provide the full stack > trace of the exception thrown in the saveAsParquetFile call? Thanks! > > Cheng > > On 3/27/15 7:35 PM, Jon Chase wrote: > > https://issues.apache.org/jira/browse/SPARK-6570 > > I also left in the call to saveAsParquetFi

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase
Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth wrote: > Thanks for your answer! > I don't call .collect because I want to trigger the execution. I call it > because I need the rdd on the driver. This is not a huge RDD and it's not > larger than the one

Spark cluster with Java 8 using ./spark-ec2

2014-11-25 Thread Jon Chase
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs Java 8, but so far I haven't been able to get the Spark processes to use the right JVM at start up. Here's the command I use for launching the cluster. Note I'm using the user-data feature to install Java 8: ./spark-ec2

Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-18 Thread Jon Chase
I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not having any download speed issues (Amazon's EMR provides a custom implementat

Re: "Fetch Failure"

2014-12-19 Thread Jon Chase
I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == long

Re: "Fetch Failure"

2014-12-19 Thread Jon Chase
at it > requests from yarn. It's required because jvms take up some memory beyond > their heap size. > > -Sandy > > On Dec 19, 2014, at 9:04 AM, Jon Chase wrote: > > I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k > small fi

Re: "Fetch Failure"

2014-12-19 Thread Jon Chase
mitted 7392K, reserved 1048576K On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase wrote: > I'm actually already running 1.1.1. > > I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no > luck. Still getting "ExecutorLostFailure (executor lost)". >

Re: "Fetch Failure"

2014-12-19 Thread Jon Chase
Yes, same problem. On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza wrote: > Do you hit the same errors? Is it now saying your containers are exceed > ~10 GB? > > On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase wrote: >> >> I'm actually already running 1.1.1. &

Yarn not running as many executors as I'd like

2014-12-19 Thread Jon Chase
Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn to use the number of executors that I specify in spark-submit: --num-executors 2 In a cluster with two core nodes will typically only result in one executor running at a time. I can play with the memory settings and num-co

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems to work much better. Can't find the link ATM, but seems I recall that s3:// (Hadoop's original block format for s3) is no longer recommended for use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the scene

Re: JavaRDD (Data Aggregation) based on key

2014-12-23 Thread Jon Chase
Have a look at RDD.groupBy(...) and reduceByKey(...) On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh wrote: > Hi, > I have a csv file having fields as a,b,c . > I want to do aggregation(sum,average..) based on any field(a,b or c) as per > user input, > using Apache Spark Java API,Please Help Urgen

Re: S3 files , Spark job hungsup

2014-12-23 Thread Jon Chase
http://www.jets3t.org/toolkit/configuration.html Put the following properties in a file named jets3t.properties and make sure it is available during the running of your Spark job (just place it in ~/ and pass a reference to it when calling spark-submit with --file ~/jets3t.properties) httpclien

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-23 Thread Jon Chase
ultifarious, Inc. | http://mult.ifario.us/ > > On Thu, Dec 18, 2014 at 5:56 AM, Jon Chase wrote: > >> I'm running a very simple Spark application that downloads files from S3, >> does a bit of mapping, then uploads new files. Each file is roughly 2MB >> and