The biggest danger with gzipped files is this:

>>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1

You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.

It might be a nice user hint if Spark warned when parallelism is disabled
by the input format.

Nick
​

On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote:

> Hi Nicholas,
>
> Gzipping is a an impressive guess! Yes, they are.
> My data sets are too large to make repartitioning viable, but I could try
> it on a subset.
> I generally have many more partitions than cores.
> This was happenning before I started setting those configs.
>
> thanks
> Daniel
>
>
> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Are you dealing with gzipped files by any chance? Does explicitly
>> repartitioning your RDD to match the number of cores in your cluster help
>> at all? How about if you don't specify the configs you listed and just go
>> with defaults all around?
>>
>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>>
>>> I launch the cluster using vanilla spark-ec2 scripts.
>>> I just specify the number of slaves and instance type
>>>
>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com>
>>> wrote:
>>>
>>>> I usually run interactively from the spark-shell.
>>>> My data definitely has more than enough partitions to keep all the
>>>> workers busy.
>>>> When I first launch the cluster I first do:
>>>>
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>> cat <<EOF >>~/spark/conf/spark-defaults.conf
>>>> spark.serializer        org.apache.spark.serializer.KryoSerializer
>>>> spark.rdd.compress      true
>>>> spark.shuffle.consolidateFiles  true
>>>> spark.akka.frameSize  20
>>>> EOF
>>>>
>>>> copy-dir /root/spark/conf
>>>> spark/sbin/stop-all.sh
>>>> sleep 5
>>>> spark/sbin/start-all.sh
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>> before starting the spark-shell or running any jobs.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>>>> your system.
>>>>>
>>>>> Could you post a simple code snippet and explain what kind of
>>>>> parallelism you are seeing for it? And can you report on how many
>>>>> partitions your RDDs have?
>>>>>
>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>>>> My understanding is that this configures spark to use the available
>>>>>> resources.
>>>>>> I can see that spark will use the available memory on larger istance
>>>>>> types.
>>>>>> However I have never seen spark running at more than 400% (using 100%
>>>>>> on 4 cores)
>>>>>> on machines with many more cores.
>>>>>> Am I misunderstanding the docs? Is it just that high end ec2
>>>>>> instances get I/O starved when running spark? It would be strange if that
>>>>>> consistently produced a 400% hard limit though.
>>>>>>
>>>>>> thanks
>>>>>> Daniel
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to