Hi Matei, I am hitting similar problems with 10 ALS iterations...I am running with 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50
The first iteration of flatMaps run fine which means that the memory requirements are good per iteration... If I do check-pointing on RDD, most likely rest 9 iterations will also run fine and I will get the results... Is there a plan to add checkpoint option to ALS for such large factorization jobs ? Thanks. Deb On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > That would be great to add. Right now it would be easy to change it to use > another Hadoop FileSystem implementation at the very least (I think you can > just pass the URL for that), but for Cassandra you'd have to use a > different InputFormat or some direct Cassandra access API. > > Matei > > On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote: > > > By the way, is there any plan to make a pluggable backend for > > checkpointing? We might be interested in writing a, for example, > > Cassandra backend. > > > > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <junluan....@intel.com> > wrote: > >> Hi all > >> > >> The description about this Bug submitted by Matei is as following > >> > >> > >> The tipping point seems to be around 50. We should fix this by > checkpointing the RDDs every 10-20 iterations to break the lineage chain, > but checkpointing currently requires HDFS installed, which not all users > will have. > >> > >> We might also be able to fix DAGScheduler to not be recursive. > >> > >> > >> regards, > >> Andrew > >> > > > > > > > > -- > > -- > > Evan Chan > > Staff Engineer > > e...@ooyala.com | > >