Re: Column Similarity using DIMSUM

Reza Zadeh Thu, 19 Mar 2015 11:30:09 -0700

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows, but
still quadratically with the number of columns. I will be updating the
documentation to make this clear.
Best,
Reza


On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 <mgupt...@sapient.com>
wrote:

>  Hi Reza,
>
>
>
> *Behavior*:
>
> ·         I tried running the job with different thresholds - 0.1, 0.5,
> 5, 20 & 100.  Every time, the job got stuck at mapPartitionsWithIndex at
> RowMatrix.scala:522
> <http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0>
>  with
> all workers running on 100% CPU. There is hardly any shuffle read/write
> happening. And after some time, “ERROR YarnClientClusterScheduler: Lost
> executor” start showing (maybe because of the nodes running on 100% CPU).
>
> ·         For threshold 200+ (tried up to 1000) it gave an error (here
> xxxxxxxxxxxxxxxx was different for different thresholds)
>
> Exception in thread "main" java.lang.IllegalArgumentException: requirement
> failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx
>
>                 at scala.Predef$.require(Predef.scala:233)
>
>                 at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
>
>                 at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
>
>                 at
> EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
>
>                 at EntitySimilarity$.main(EntitySimilarity.scala:80)
>
>                 at EntitySimilarity.main(EntitySimilarity.scala)
>
>                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
>                 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
>                 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>                 at java.lang.reflect.Method.invoke(Method.java:606)
>
>                 at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>
>                 at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>
>                 at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> ·         If I get rid of frequently occurring attributes and keep only
> those attributes which are occurring in at 2% entities, then job doesn’t
> stuck / fail.
>
>
>
> *Data & environment*:
>
> ·         RowMatrix of size 43345 X 56431
>
> ·         In the matrix there are couple of rows, whose value is same in
> up to 50% of the columns (frequently occurring attributes).
>
> ·         I am running this, on one of our Dev cluster running on CDH
> 5.3.0 5 data nodes (each 4-core and 16GB RAM).
>
>
>
> My question – Do you think this is a hardware size issue and we should
> test it on larger machines?
>
>
>
> Regards,
>
> Manish
>
>
>
> *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com]
> *Sent:* Wednesday, March 18, 2015 11:20 PM
> *To:* Reza Zadeh
> *Cc:* user@spark.apache.org
> *Subject:* RE: Column Similarity using DIMSUM
>
>
>
> Hi Reza,
>
>
>
> I have tried threshold to be only in the range of 0 to 1. I was not aware
> that threshold can be set to above 1.
>
> Will try and update.
>
>
>
> Thank You
>
>
>
> - Manish
>
>
>
> *From:* Reza Zadeh [mailto:r...@databricks.com <r...@databricks.com>]
> *Sent:* Wednesday, March 18, 2015 10:55 PM
> *To:* Manish Gupta 8
> *Cc:* user@spark.apache.org
> *Subject:* Re: Column Similarity using DIMSUM
>
>
>
> Hi Manish,
>
> Did you try calling columnSimilarities(threshold) with different threshold
> values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
>
> Best,
>
> Reza
>
>
>
> On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mgupt...@sapient.com>
> wrote:
>
>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), job falls apart.
> Whole job get stuck and worker nodes start running on 100% CPU without
> making any progress on the job stage. If the dataset is very small (in the
> range of 1000 Entities X 500 attributes (some frequently occurring)) the
> job finishes but takes too long (some time it gives GC errors too).
>
>
>
> If none of the attribute is frequently occurring (all < 2%), then job runs
> in a lightning fast manner (even for 1000000 Entities X 10000 attributes)
> and results are very accurate.
>
>
>
> I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
> and 16GB of RAM.
>
>
>
> My question is - *Is this behavior expected for datasets where some
> Attributes frequently occur*?
>
>
>
> Thanks,
>
> Manish Gupta
>
>
>
>
>
>
>

Re: Column Similarity using DIMSUM

Reply via email to