I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type

On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote:

> I usually run interactively from the spark-shell.
> My data definitely has more than enough partitions to keep all the workers
> busy.
> When I first launch the cluster I first do:
>
> +++++++++++++++++++++++++++++++++++++++++++++++++
> cat <<EOF >>~/spark/conf/spark-defaults.conf
> spark.serializer        org.apache.spark.serializer.KryoSerializer
> spark.rdd.compress      true
> spark.shuffle.consolidateFiles  true
> spark.akka.frameSize  20
> EOF
>
> copy-dir /root/spark/conf
> spark/sbin/stop-all.sh
> sleep 5
> spark/sbin/start-all.sh
> +++++++++++++++++++++++++++++++++++++++++++++++++
>
> before starting the spark-shell or running any jobs.
>
>
>
>
> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>> your system.
>>
>> Could you post a simple code snippet and explain what kind of parallelism
>> you are seeing for it? And can you report on how many partitions your RDDs
>> have?
>>
>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>
>>>
>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>> My understanding is that this configures spark to use the available
>>> resources.
>>> I can see that spark will use the available memory on larger istance
>>> types.
>>> However I have never seen spark running at more than 400% (using 100% on
>>> 4 cores)
>>> on machines with many more cores.
>>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>>> get I/O starved when running spark? It would be strange if that
>>> consistently produced a 400% hard limit though.
>>>
>>> thanks
>>> Daniel
>>>
>>
>>
>

Reply via email to