Hi Reza,
With 40 nodes and shuffle space managed by YARN over HDFS usercache we
could run the similarity job without doing any thresholding...We used hash
based shuffle and sort hopefully will further improve it...Note that this
job was almost 6M x 1.5M
We will go towards 50 M x ~ 3M columns and
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m
and I made sparsity pattern 100:1.5M..we would like to increase the
sparsity pattern to 1000:1.5M
I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort
shuffle will help..
I read in Reza paper that ov
The complexity of DIMSUM is independent of the number of rows but
still have quadratic dependency on the number of columns. 1.5M columns
may be too large to use DIMSUM. Try to increase the threshold and see
whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das wrote:
> Hi,
>
>
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...
Seems like I need more nodes or use dimsum sampling ?
I