How do you partition by product in Python? the only API is partitionBy(50) On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> Also in my experiments, it's much faster to blocked BLAS through cartesian > rather than doing sc.union. Here are the details on the experiments: > > https://issues.apache.org/jira/browse/SPARK-4823 > > On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > Also not sure how threading helps here because Spark puts a partition to each > core. On each core may be there are multiple threads if you are using intel > hyperthreading but I will let Spark handle the threading. > > On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS dgemm > based calculation. > > On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat > <ayman.fara...@yahoo.com.invalid> wrote: > Thanks Sabarish and Nick > Would you happen to have some code snippets that you can share. > Best > Ayman > > On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan > <sabarish.sasidha...@manthan.com> wrote: > >> Nick is right. I too have implemented this way and it works just fine. In my >> case, there can be even more products. You simply broadcast blocks of >> products to userFeatures.mapPartitions() and BLAS multiply in there to get >> recommendations. In my case 10K products form one block. Note that you would >> then have to union your recommendations. And if there lots of product >> blocks, you might also want to checkpoint once every few times. >> >> Regards >> Sab >> >> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> One issue is that you broadcast the product vectors and then do a dot >> product one-by-one with the user vector. >> >> You should try forming a matrix of the item vectors and doing the dot >> product as a matrix-vector multiply which will make things a lot faster. >> >> Another optimisation that is avalailable on 1.4 is a recommendProducts >> method that blockifies the factors to make use of level 3 BLAS (ie >> matrix-matrix multiply). I am not sure if this is available in The Python >> api yet. >> >> But you can do a version yourself by using mapPartitions over user factors, >> blocking the factors into sub-matrices and doing matrix multiply with item >> factor matrix to get scores on a block-by-block basis. >> >> Also as Ilya says more parallelism can help. I don't think it's so necessary >> to do LSH with 30,000 items. >> >> — >> Sent from Mailbox >> >> >> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya <ilya.gane...@capitalone.com> >> wrote: >> >> Actually talk about this exact thing in a blog post here >> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. >> Keep in mind, you're actually doing a ton of math. Even with proper caching >> and use of broadcast variables this will take a while defending on the size >> of your cluster. To get real results you may want to look into locality >> sensitive hashing to limit your search space and definitely look into >> spinning up multiple threads to process your product features in parallel to >> increase resource utilization on the cluster. >> >> >> >> Thank you, >> Ilya Ganelin >> >> >> >> -----Original Message----- >> From: afarahat [ayman.fara...@yahoo.com] >> Sent: Wednesday, June 17, 2015 11:16 PM Eastern Standard Time >> To: user@spark.apache.org >> Subject: Matrix Multiplication and mllib.recommendation >> >> Hello; >> I am trying to get predictions after running the ALS model. >> The model works fine. In the prediction/recommendation , I have about 30 >> ,000 products and 90 Millions users. >> When i try the predict all it fails. >> I have been trying to formulate the problem as a Matrix multiplication where >> I first get the product features, broadcast them and then do a dot product. >> Its still very slow. Any reason why >> here is a sample code >> >> def doMultiply(x): >> a = [] >> #multiply by >> mylen = len(pf.value) >> for i in range(mylen) : >> myprod = numpy.dot(x,pf.value[i][1]) >> a.append(myprod) >> return a >> >> >> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath") >> #I need to select which products to broadcast but lets try all >> m1 = myModel.productFeatures().sample(False, 0.001) >> pf = sc.broadcast(m1.collect()) >> uf = myModel.userFeatures() >> f1 = uf.map(lambda x : (x[0], doMultiply(x[1]))) >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> The information contained in this e-mail is confidential and/or proprietary >> to Capital One and/or its affiliates and may only be used solely in >> performance of work or services for Capital One. The information transmitted >> herewith is intended only for use by the individual or entity to which it is >> addressed. If the reader of this message is not the intended recipient, you >> are hereby notified that any review, retransmission, dissemination, >> distribution, copying or other use of, or taking of any action in reliance >> upon this information is strictly prohibited. If you have received this >> communication in error, please contact the sender and delete the material >> from your computer. >> >> >> >> >> -- >> >> Architect - Big Data >> Ph: +91 99805 99458 >> >> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan >> India ICT) >> +++ > > > >