Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, It appears that the bottleneck in my job was the EBS volumes. Very high i/o wait times across the cluster. I was only using 1 volume. Increasing to 4 made it faster. Thanks, Pradeep On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota wrote: > Hi All, > > I have a simple ETL

Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, I have a simple ETL job that reads some data, shuffles it and writes it back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0. After Stage 0 completes and the job starts Stage 1, I see a huge slowdown in the job. The CPU usage is low on the cluster, as is the network I/O. >From the

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat usually by specifying the max split size. Have you looked into that possibility with your InputFormat? On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote: > Hi Jasbir, > > Yes, you are right. Do you have any idea about my ques

Executors under utilized

2016-10-06 Thread Pradeep Gollakota
I'm running a job that one stage with about 60k tasks. The stage was going pretty well until many of the executors were not running any tasks at around 35k tasks finished. It came to the point where only 4 executors are working on data, all 4 executors are running on the same host. With about 25k t

Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not https://spark.apache.org On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart wrote: > Same here > > > > *From: *Benjamin Kim > *Date: *Wednesday, July 13, 2016 at 11:47 AM > *To: *manish ranjan > *Cc: *user > *Subject: *Re: Spark W

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/ On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: > Yes, that's what I suggest. TextInputFormat support multiple inputs. So in > spark side, we just need to provide API to for that. > > On Thu, Nov 12, 2015 at 8:45

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated list. I haven't tried this, but I think you should just be able to do sc.textFile("file1,file2,...") On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > I know these workaround, but wouldn't it be more convenient and > strai