; tasks to the slaves.
>
> Thanks
>
> Andy
>
> From: Daniel Mahler
> Date: Monday, October 20, 2014 at 5:22 PM
> To: Nicholas Chammas
> Cc: user
> Subject: Re: Getting spark to use more than 4 cores on Amazon EC2
>
> I am using globs though
>
> raw = sc.text
PM
To: Nicholas Chammas
Cc: user
Subject: Re: Getting spark to use more than 4 cores on Amazon EC2
> I am using globs though
>
> raw = sc.textFile("/path/to/dir/*/*")
>
> and I have tons of files so 1 file per partition should not be a problem.
>
> On Mon, Oc
I am using globs though
raw = sc.textFile("/path/to/dir/*/*")
and I have tons of files so 1 file per partition should not be a problem.
On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> The biggest danger with gzipped files is this:
>
> >>> raw = sc.textFi
The biggest danger with gzipped files is this:
>>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1
You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.
I
Hi Nicholas,
Gzipping is a an impressive guess! Yes, they are.
My data sets are too large to make repartitioning viable, but I could try
it on a subset.
I generally have many more partitions than cores.
This was happenning before I started setting those configs.
thanks
Daniel
On Mon, Oct 20, 20
Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?
On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler wrote:
>
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler wrote:
> I usually run interactively from the spark-shell.
> My data definitely has more than enough partitions to keep all the workers
> bus
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all the workers
busy.
When I first launch the cluster I first do:
+
cat <>~/spark/conf/spark-defaults.conf
spark.serializerorg.apache
Perhaps your RDD is not partitioned enough to utilize all the cores in your
system.
Could you post a simple code snippet and explain what kind of parallelism
you are seeing for it? And can you report on how many partitions your RDDs
have?
On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler wrote:
>
How are you launching the cluster, and how are you submitting the job to
it? Can you list any Spark configuration parameters you provide?
On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler wrote:
>
> I am launching EC2 clusters using the spark-ec2 scripts.
> My understanding is that this configures
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on ma
11 matches
Mail list logo