Hi Eilidh

Because you are multiplying with the transpose you don't have  to
necessarily build the right side of the matrix. I hope you see that. You
can broadcast blocks of the indexed row matrix to itself and achieve the
multiplication.

But for similarity computation you might want to use some approach like
locality sensitive hashing first to identify a bunch of similar customers
and then apply cosine similarity on that narrowed down list. That would
scale much better than matrix multiplication. You could try the following
options for the same.

https://github.com/soundcloud/cosine-lsh-join-spark
http://spark-packages.org/package/tdebatty/spark-knn-graphs
https://github.com/marufaytekin/lsh-spark

Regards
Sab
Hi Sab,

Thanks for your response. We’re thinking of trying a bigger cluster,
because we just started with 2 nodes. What we really want to know is
whether the code will scale up with larger matrices and more nodes. I’d be
interested to hear how large a matrix multiplication you managed to do?

Is there an alternative you’d recommend for calculating similarity over a
large dataset?

Thanks,
Eilidh

On 13 Nov 2015, at 09:55, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

We have done this by blocking but without using BlockMatrix. We used our
own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
is the size of your block? How much memory are you giving to the executors?
I assume you are running on YARN, if so you would want to make sure your
yarn executor memory overhead is set to a higher value than default.

Just curious, could you also explain why you need matrix multiplication
with transpose? Smells like similarity computation.

Regards
Sab

On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup <e.tr...@epcc.ed.ac.uk> wrote:

> Hi,
>
> I’m trying to multiply a large squarish matrix with its transpose.
> Eventually I’d like to work with matrices of size 200,000 by 500,000, but
> I’ve started off first with 100 by 100 which was fine, and then with 10,000
> by 10,000 which failed with an out of memory exception.
>
> I used MLlib and BlockMatrix and tried various block sizes, and also tried
> switching disk serialisation on.
>
> We are running on a small cluster, using a CSV file in HDFS as the input
> data.
>
> Would anyone with experience of multiplying large, dense matrices in spark
> be able to comment on what to try to make this work?
>
> Thanks,
> Eilidh
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++



The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to