I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all the workers
busy.
When I first launch the cluster I first do:

+++++++++++++++++++++++++++++++++++++++++++++++++
cat <<EOF >>~/spark/conf/spark-defaults.conf
spark.serializer        org.apache.spark.serializer.KryoSerializer
spark.rdd.compress      true
spark.shuffle.consolidateFiles  true
spark.akka.frameSize  20
EOF

copy-dir /root/spark/conf
spark/sbin/stop-all.sh
sleep 5
spark/sbin/start-all.sh
+++++++++++++++++++++++++++++++++++++++++++++++++

before starting the spark-shell or running any jobs.




On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Perhaps your RDD is not partitioned enough to utilize all the cores in
> your system.
>
> Could you post a simple code snippet and explain what kind of parallelism
> you are seeing for it? And can you report on how many partitions your RDDs
> have?
>
> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>
>>
>> I am launching EC2 clusters using the spark-ec2 scripts.
>> My understanding is that this configures spark to use the available
>> resources.
>> I can see that spark will use the available memory on larger istance
>> types.
>> However I have never seen spark running at more than 400% (using 100% on
>> 4 cores)
>> on machines with many more cores.
>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>> get I/O starved when running spark? It would be strange if that
>> consistently produced a 400% hard limit though.
>>
>> thanks
>> Daniel
>>
>
>

Reply via email to