yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD
at the end of each iteration. Otherwise all of the lookup tables are shipped to
the cluster at the same time resulting in memory errors. Therefore this becomes
several map jobs instead of one and each consecutive map
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote:
> the runtime for each consecutive iteration is still roughly twice as long as
> for the previous one -- is there a way to reduce whatever overhead is
> accumulating?
Sorry, I didn't fully understand you question, which two are you comparing?
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
On Feb 11, 2015, at 8:11 PM, Davies Liu wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
>> I was having trouble with memory e
Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, "Davies Liu" wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
> > I was having trouble with memory exceptions when broadcasting a large
> lookup
> > table, so I've resorted to processing it iteratively -- but how can I
> modi
On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
> I was having trouble with memory exceptions when broadcasting a large lookup
> table, so I've resorted to processing it iteratively -- but how can I modify
> an RDD iteratively?
>
> I'm trying something like :
>
> rdd = sc.parallelize(...)
> lookup_ta
We have moved to use Sphinx to generate the Python API docs, so the
link is different than 1.0/1
http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions
On Wed, Feb 11, 2015 at 10:55 AM, Charles Feduke
wrote:
> If you use mapPartitions to iterate the lookup_tables do
Yes I actually do use mapPartitions already
On Feb 11, 2015 7:55 PM, "Charles Feduke" wrote:
> If you use mapPartitions to iterate the lookup_tables does that improve
> the performance?
>
> This link is to Spark docs 1.1 because both latest and 1.2 for Python give
> me a 404:
> http://spark.apach
If you use mapPartitions to iterate the lookup_tables does that improve the
performance?
This link is to Spark docs 1.1 because both latest and 1.2 for Python give
me a 404:
http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
On Wed Feb 11 2015 at 1:48:42 PM rok