Hi Eilidh Because you are multiplying with the transpose you don't have to necessarily build the right side of the matrix. I hope you see that. You can broadcast blocks of the indexed row matrix to itself and achieve the multiplication.
But for similarity computation you might want to use some approach like locality sensitive hashing first to identify a bunch of similar customers and then apply cosine similarity on that narrowed down list. That would scale much better than matrix multiplication. You could try the following options for the same. https://github.com/soundcloud/cosine-lsh-join-spark http://spark-packages.org/package/tdebatty/spark-knn-graphs https://github.com/marufaytekin/lsh-spark Regards Sab Hi Sab, Thanks for your response. We’re thinking of trying a bigger cluster, because we just started with 2 nodes. What we really want to know is whether the code will scale up with larger matrices and more nodes. I’d be interested to hear how large a matrix multiplication you managed to do? Is there an alternative you’d recommend for calculating similarity over a large dataset? Thanks, Eilidh On 13 Nov 2015, at 09:55, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: We have done this by blocking but without using BlockMatrix. We used our own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the size of your block? How much memory are you giving to the executors? I assume you are running on YARN, if so you would want to make sure your yarn executor memory overhead is set to a higher value than default. Just curious, could you also explain why you need matrix multiplication with transpose? Smells like similarity computation. Regards Sab On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup <e.tr...@epcc.ed.ac.uk> wrote: > Hi, > > I’m trying to multiply a large squarish matrix with its transpose. > Eventually I’d like to work with matrices of size 200,000 by 500,000, but > I’ve started off first with 100 by 100 which was fine, and then with 10,000 > by 10,000 which failed with an out of memory exception. > > I used MLlib and BlockMatrix and tried various block sizes, and also tried > switching disk serialisation on. > > We are running on a small cluster, using a CSV file in HDFS as the input > data. > > Would anyone with experience of multiplying large, dense matrices in spark > be able to comment on what to try to make this work? > > Thanks, > Eilidh > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++ The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.