Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD at the end of each iteration. Otherwise all of the lookup tables are shipped to the cluster at the same time resulting in memory errors. Therefore this becomes several map jobs instead of one and each consecutive map

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote: > the runtime for each consecutive iteration is still roughly twice as long as > for the previous one -- is there a way to reduce whatever overhead is > accumulating? Sorry, I didn't fully understand you question, which two are you comparing?

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: >> I was having trouble with memory e

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, "Davies Liu" wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > > I was having trouble with memory exceptions when broadcasting a large > lookup > > table, so I've resorted to processing it iteratively -- but how can I > modi

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > I was having trouble with memory exceptions when broadcasting a large lookup > table, so I've resorted to processing it iteratively -- but how can I modify > an RDD iteratively? > > I'm trying something like : > > rdd = sc.parallelize(...) > lookup_ta

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
We have moved to use Sphinx to generate the Python API docs, so the link is different than 1.0/1 http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions On Wed, Feb 11, 2015 at 10:55 AM, Charles Feduke wrote: > If you use mapPartitions to iterate the lookup_tables do

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Yes I actually do use mapPartitions already On Feb 11, 2015 7:55 PM, "Charles Feduke" wrote: > If you use mapPartitions to iterate the lookup_tables does that improve > the performance? > > This link is to Spark docs 1.1 because both latest and 1.2 for Python give > me a 404: > http://spark.apach

Re: iteratively modifying an RDD

2015-02-11 Thread Charles Feduke
If you use mapPartitions to iterate the lookup_tables does that improve the performance? This link is to Spark docs 1.1 because both latest and 1.2 for Python give me a 404: http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions On Wed Feb 11 2015 at 1:48:42 PM rok