Hi Manish, With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a single row is dense, that can end up overwhelming a machine. You can push that up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: so it scales linearly and across cluster with rows, but still quadratically with the number of columns. I will be updating the documentation to make this clear. Best, Reza
On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 <mgupt...@sapient.com> wrote: > Hi Reza, > > > > *Behavior*: > > · I tried running the job with different thresholds - 0.1, 0.5, > 5, 20 & 100. Every time, the job got stuck at mapPartitionsWithIndex at > RowMatrix.scala:522 > <http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> > with > all workers running on 100% CPU. There is hardly any shuffle read/write > happening. And after some time, “ERROR YarnClientClusterScheduler: Lost > executor” start showing (maybe because of the nodes running on 100% CPU). > > · For threshold 200+ (tried up to 1000) it gave an error (here > xxxxxxxxxxxxxxxx was different for different thresholds) > > Exception in thread "main" java.lang.IllegalArgumentException: requirement > failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx > > at scala.Predef$.require(Predef.scala:233) > > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511) > > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492) > > at > EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241) > > at EntitySimilarity$.main(EntitySimilarity.scala:80) > > at EntitySimilarity.main(EntitySimilarity.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) > > at > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > > at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > · If I get rid of frequently occurring attributes and keep only > those attributes which are occurring in at 2% entities, then job doesn’t > stuck / fail. > > > > *Data & environment*: > > · RowMatrix of size 43345 X 56431 > > · In the matrix there are couple of rows, whose value is same in > up to 50% of the columns (frequently occurring attributes). > > · I am running this, on one of our Dev cluster running on CDH > 5.3.0 5 data nodes (each 4-core and 16GB RAM). > > > > My question – Do you think this is a hardware size issue and we should > test it on larger machines? > > > > Regards, > > Manish > > > > *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com] > *Sent:* Wednesday, March 18, 2015 11:20 PM > *To:* Reza Zadeh > *Cc:* user@spark.apache.org > *Subject:* RE: Column Similarity using DIMSUM > > > > Hi Reza, > > > > I have tried threshold to be only in the range of 0 to 1. I was not aware > that threshold can be set to above 1. > > Will try and update. > > > > Thank You > > > > - Manish > > > > *From:* Reza Zadeh [mailto:r...@databricks.com <r...@databricks.com>] > *Sent:* Wednesday, March 18, 2015 10:55 PM > *To:* Manish Gupta 8 > *Cc:* user@spark.apache.org > *Subject:* Re: Column Similarity using DIMSUM > > > > Hi Manish, > > Did you try calling columnSimilarities(threshold) with different threshold > values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. > > Best, > > Reza > > > > On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mgupt...@sapient.com> > wrote: > > Hi, > > > > I am running Column Similarity (All Pairs Similarity using DIMSUM) in > Spark on a dataset that looks like (Entity, Attribute, Value) after > transforming the same to a row-oriented dense matrix format (one line per > Attribute, one column per Entity, each cell with normalized value – between > 0 and 1). > > > > It runs extremely fast in computing similarities between Entities in most > of the case, but if there is even a single attribute which is frequently > occurring across the entities (say in 30% of entities), job falls apart. > Whole job get stuck and worker nodes start running on 100% CPU without > making any progress on the job stage. If the dataset is very small (in the > range of 1000 Entities X 500 attributes (some frequently occurring)) the > job finishes but takes too long (some time it gives GC errors too). > > > > If none of the attribute is frequently occurring (all < 2%), then job runs > in a lightning fast manner (even for 1000000 Entities X 10000 attributes) > and results are very accurate. > > > > I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores > and 16GB of RAM. > > > > My question is - *Is this behavior expected for datasets where some > Attributes frequently occur*? > > > > Thanks, > > Manish Gupta > > > > > > >