Re: for loops in pyspark

Femi Anthony Thu, 21 Sep 2017 06:08:03 -0700

How much memory does the "do some stuff" portions occupy ? You should try 
caching the RDD and take a look at the Spark UI under the Storage tab to see 
how much memory is being used. Also, what portion of the overall memory of each 
worker are you allocating when you cal spark-submit ?


Sent from my iPhone

> On Sep 21, 2017, at 6:58 AM, Jeff Zhang <zjf...@gmail.com> wrote:
> 
> 
> I suspect OOO happens in executor side, you have to check the stacktrace by 
> yourself if you can not attach more info. Most likely it is due to your user 
> code.
> 
> 
> Alexander Czech <alexander.cz...@googlemail.com>于2017年9月21日周四 下午5:54写道：
>> That is not really possible the whole project is rather large and I would 
>> not like to release it before I published the results.
>> 
>> But if there is no know issues with doing spark in a for loop I will look 
>> into other possibilities for memory leaks.
>> 
>> Thanks
>> 
>> 
>> On 20 Sep 2017 15:22, "Weichen Xu" <weichen...@databricks.com> wrote:
>> Spark manage memory allocation and release automatically. Can you post the 
>> complete program which help checking where is wrong ?
>> 
>> On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech 
>> <alexander.cz...@googlemail.com> wrote:
>>> Hello all,
>>> 
>>> I'm running a pyspark script that makes use of for loop to create smaller 
>>> chunks of my main dataset.
>>> 
>>> some example code:
>>> 
>>> for chunk in chunks:
>>>     my_rdd = sc.parallelize(chunk).flatmap(somefunc) 
>>>     # do some stuff with my_rdd
>>>     
>>>     my_df = make_df(my_rdd)
>>>     # do some stuff with my_df
>>>     my_df.write.parquet('./some/path')
>>> 
>>> After a couple of loops I always start to loose executors because out of 
>>> memory errors. Is there a way free up memory after an loop? Do I have to do 
>>> it in python or with spark?
>>> 
>>> Thanks
>> 
>>

Re: for loops in pyspark

Reply via email to