I found this parameter in python/pyspark/rdd.py

partitionBy() has some explanation on how it tries to reduce amount of data
transferred to Java.

FYI

On Tue, Oct 27, 2015 at 6:29 PM, Connor Zanin <cnnr...@udel.edu> wrote:

> Hi all,
>
> I am running a simple word count job on a cluster of 4 nodes (24 cores per
> node). I am varying two parameter in the configuration,
> spark.python.worker.memory and the number of partitions in the RDD. My job
> is written in python.
>
> I am observing a discontinuity in the run time of the job when the
> spark.python.worker.memory is increased past a threshold. Unfortunately, I
> am having trouble understanding exactly what this parameter is doing to
> Spark internally and how it changes Spark's behavior to create this
> discontinuity.
>
> The documentation describes this parameter as "Amount of memory to use
> per python worker process during aggregation," but I find this is vague (or
> I do not know enough Spark terminology to know what it means).
>
> I have been pointed to the source code in the past, specifically the
> shuffle.py file where _spill() appears.
>
> Can anyone explain how this parameter behaves or point me to more
> descriptive documentation? Thanks!
>
> --
> Regards,
>
> Connor Zanin
> Computer Science
> University of Delaware
>

Reply via email to