Hi Nicholas,

Gzipping is a an impressive guess! Yes, they are.
My data sets are too large to make repartitioning viable, but I could try
it on a subset.
I generally have many more partitions than cores.
This was happenning before I started setting those configs.

thanks
Daniel


On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Are you dealing with gzipped files by any chance? Does explicitly
> repartitioning your RDD to match the number of cores in your cluster help
> at all? How about if you don't specify the configs you listed and just go
> with defaults all around?
>
> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>
>> I launch the cluster using vanilla spark-ec2 scripts.
>> I just specify the number of slaves and instance type
>>
>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>
>>> I usually run interactively from the spark-shell.
>>> My data definitely has more than enough partitions to keep all the
>>> workers busy.
>>> When I first launch the cluster I first do:
>>>
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>> cat <<EOF >>~/spark/conf/spark-defaults.conf
>>> spark.serializer        org.apache.spark.serializer.KryoSerializer
>>> spark.rdd.compress      true
>>> spark.shuffle.consolidateFiles  true
>>> spark.akka.frameSize  20
>>> EOF
>>>
>>> copy-dir /root/spark/conf
>>> spark/sbin/stop-all.sh
>>> sleep 5
>>> spark/sbin/start-all.sh
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>> before starting the spark-shell or running any jobs.
>>>
>>>
>>>
>>>
>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>>> your system.
>>>>
>>>> Could you post a simple code snippet and explain what kind of
>>>> parallelism you are seeing for it? And can you report on how many
>>>> partitions your RDDs have?
>>>>
>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>>> My understanding is that this configures spark to use the available
>>>>> resources.
>>>>> I can see that spark will use the available memory on larger istance
>>>>> types.
>>>>> However I have never seen spark running at more than 400% (using 100%
>>>>> on 4 cores)
>>>>> on machines with many more cores.
>>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>>>>> get I/O starved when running spark? It would be strange if that
>>>>> consistently produced a 400% hard limit though.
>>>>>
>>>>> thanks
>>>>> Daniel
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to