Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Sabarish Sasidharan Sun, 01 Mar 2015 21:37:57 -0800

Hi Reza

I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead. Is that
correct? How can we make this scale for even larger n (when m stays small)
like 100 x 5 million. One is by using higher thresholds. The other is that
I use a SparseVector to begin with. Are there any other optimizations I can
take advantage of?


Thanks
Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Reply via email to