Hi All,
It appears that the bottleneck in my job was the EBS volumes. Very high i/o
wait times across the cluster. I was only using 1 volume. Increasing to 4
made it faster.
Thanks,
Pradeep
On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota
wrote:
> Hi All,
>
> I have a simple ETL
Hi All,
I have a simple ETL job that reads some data, shuffles it and writes it
back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.
After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
in the job. The CPU usage is low on the cluster, as is the network I/O.
>From the
Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?
On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote:
> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my ques
I'm running a job that one stage with about 60k tasks. The stage was going
pretty well until many of the executors were not running any tasks at
around 35k tasks finished. It came to the point where only 4 executors are
working on data, all 4 executors are running on the same host. With about
25k t
Worked for me if I go to https://spark.apache.org/site/ but not
https://spark.apache.org
On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart
wrote:
> Same here
>
>
>
> *From: *Benjamin Kim
> *Date: *Wednesday, July 13, 2016 at 11:47 AM
> *To: *manish ranjan
> *Cc: *user
> *Subject: *Re: Spark W
Looks like what I was suggesting doesn't work. :/
On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote:
> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")
On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote:
> I know these workaround, but wouldn't it be more convenient and
> strai