: shiva...@eecs.berkeley.edu; dev@spark.apache.org
Subject: Re: Model parallelism with RDD
You can also use checkpoint to truncate the lineage and the data can be
persisted to HDFS. Fundamentally the state of the RDD needs to be saved to
memory or disk if you don't want to repeat the comput
ersist(true)
>
> newRDD.mean
>
> avgTime += (System.nanoTime() - t) / 1e9
>
> oldRDD = newRDD
>
> i += 1
>
> }
>
> println("Avg iteration time:" + avgTime / numIterations)
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Shivaram Venk
spark.sql.unsafe.enabled=true removes the GC when persisting/unpersisting the
DataFrame?
Best regards, Alexander
From: Ulanov, Alexander
Sent: Monday, July 13, 2015 11:15 AM
To: shiva...@eecs.berkeley.edu
Cc: dev@spark.apache.org
Subject: RE: Model parallelism with RDD
Below are the average
= newRDD
i += 1
}
println("Avg iteration time:" + avgTime / numIterations)
Best regards, Alexander
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Friday, July 10, 2015 10:04 PM
To: Ulanov, Alexander
Cc: ; dev@spark.apache.org
Subject: Re: Model parallelism with RD
Yeah I can see that being the case -- caching implies creating objects that
will be stored in memory. So there is a trade-off between storing data in
memory but having to garbage collect it later vs. recomputing the data.
Shivaram
On Fri, Jul 10, 2015 at 9:49 PM, Ulanov, Alexander
wrote:
> Hi S
Hi Shivaram,
Thank you for suggestion! If I do .cache and .count, each iteration take much
more time, which is spent in GC. Is it normal?
10 июля 2015 г., в 21:23, Shivaram Venkataraman
mailto:shiva...@eecs.berkeley.edu>> написал(а):
I think you need to do `newRDD.cache()` and `newRDD.count` b
I think you need to do `newRDD.cache()` and `newRDD.count` before you do
oldRDD.unpersist(true) -- Otherwise it might be recomputing all the
previous iterations each time.
Thanks
Shivaram
On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander
wrote:
> Hi,
>
>
>
> I am interested how scalable can b
Hi,
I am interested how scalable can be the model parallelism within Spark.
Suppose, the model contains N weights of type Double and N is so large that
does not fit into the memory of a single node. So, we can store the model in
RDD[Double] within several nodes. To train the model, one needs to