Re: for loops in pyspark

2017-09-21 Thread Femi Anthony
How much memory does the "do some stuff" portions occupy ? You should try caching the RDD and take a look at the Spark UI under the Storage tab to see how much memory is being used. Also, what portion of the overall memory of each worker are you allocating when you cal spark-submit ? Sent from

Re: for loops in pyspark

2017-09-21 Thread Jeff Zhang
I suspect OOO happens in executor side, you have to check the stacktrace by yourself if you can not attach more info. Most likely it is due to your user code. Alexander Czech 于2017年9月21日周四 下午5:54写道: > That is not really possible the whole project is rather large and I would > not like to release

Re: for loops in pyspark

2017-09-21 Thread Alexander Czech
That is not really possible the whole project is rather large and I would not like to release it before I published the results. But if there is no know issues with doing spark in a for loop I will look into other possibilities for memory leaks. Thanks On 20 Sep 2017 15:22, "Weichen Xu" wrote:

Re: for loops in pyspark

2017-09-20 Thread Weichen Xu
Spark manage memory allocation and release automatically. Can you post the complete program which help checking where is wrong ? On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > Hello all, > > I'm running a pyspark script that makes use of for loop to cr

for loops in pyspark

2017-09-20 Thread Alexander Czech
Hello all, I'm running a pyspark script that makes use of for loop to create smaller chunks of my main dataset. some example code: for chunk in chunks: my_rdd = sc.parallelize(chunk).flatmap(somefunc) # do some stuff with my_rdd my_df = make_df(my_rdd) # do some stuff with my_df