Much of this sounds related to the memory issue mentioned earlier in this thread. Are you using a build that has fixed that? That would be by far most important here.
If the raw memory requirement is 8GB, the actual heap size necessary could be a lot larger -- object overhead, all the other stuff in memory, overheads within the heap allocation, etc. So I would expect total memory requirement to be significantly more than 9GB. Still, this is the *total* requirement across the cluster. Each worker is just loading part of the matrix. If you have 10 workers I would imagine it roughly chops the per-worker memory requirement by 10x. This in turn depends on also letting workers use more than their default amount of memory. May need to increase executor memory here. Separately, I have observed issues with too many files open and lots of /tmp files. You may have to use ulimit to increase the number of open files allowed. On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das <debasish.da...@gmail.com>wrote: > Hi, > > For our usecases we are looking into 20 x 1M matrices which comes in the > similar ranges as outlined by the paper over here: > > http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html > > Is the exponential runtime growth in spark ALS as outlined by the blog > still exists in recommendation.ALS ? > > I am running a spark cluster of 10 nodes with total memory of around 1 TB > with 80 cores.... > > With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on > every worker which is around 8 GB.... > > Even if both the factor matrices are cached in memory I should be bounded > by ~ 9 GB but even with 32 GB per worker I see GC errors... > > I am debugging the scalability and memory requirements of the algorithm > further but any insights will be very helpful... > > Also there are two other issues: > > 1. If GC errors are hit, that worker JVM goes down and I have to restart it > manually. Is this expected ? > > 2. When I try to make use of all 80 cores on the cluster I get some issues > related to java.io.File not found exception on /tmp/ ? Is there some OS > limit that how many cores can simultaneously access /tmp from a process ? > > Thanks. > Deb > >