Latent Dirichlet Allocation in Spark

2017-02-16 Thread Manish Tripathi
Hi I am trying to do topic modeling in Spark using Spark's LDA package. Using Spark 2.0.2 and pyspark API. I ran the code as below: *from pyspark.ml.clustering import LDA* *lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")* *ldaModel=lda.fit(tf_df)* *lda_df=ldaModel.transfor

Cosine Similarity Implementation in Spark

2017-01-30 Thread Manish Tripathi
I have a data frame which has two columns (id, vector (tf-idf)). The first column signifies the Id of the document while the second column is a Vector(tf-idf) values. I want to use DIMSUM for cosine similarity but unfortunately I have Spark 1.x and looks like these methods are implemented only in

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
't think it will be back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi > wrote: >

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
pose you invest in improving the docs rather than saying 'this isn't > what I expected'. > > (No, our book isn't a reference for MLlib, more like worked examples) > > On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi > wrote: > >> I used a word2vec algorithm

Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: ​ I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity s

Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Manish Tripathi
Thanks a bunch. That's very helpful. On Friday, December 16, 2016, Sean Owen wrote: > That all looks correct. > > On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi > wrote: > >> ok. Thanks. So here is what I understood. >> >> Input data to Als.fit(impli

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
ᐧ On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen wrote: > No, input are weights or strengths. The output is a factorization of the > binarization of that to 0/1, not probs or a factorization of the input. > This explains the range of the output. > > > On Thu, Dec 15, 2016, 23:43

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
*is* > factoring the 0/1 matrix. > > On Thu, Dec 15, 2016, 23:31 Manish Tripathi wrote: > >> Ok. So we can kind of interpret the output as probabilities even though >> it is not modeling probabilities. This is to be able to use it for >> binaryclassification evaluator

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
values will be in [0,1], but, it's possible to get > values outside that range. > > On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi > wrote: > >> Hi >> >> ran the ALS model for implicit feedback thing. Then I used the .transform >> method of the mo

Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Hi ran the ALS model for implicit feedback thing. Then I used the .transform method of the model to predict the ratings for the original dataset. My dataset is of the form (user,item,rating) I see something like below: predictions.show(5,truncate=False) Why is the last prediction value negativ

Spark Float to VectorUDT for ML evaluator lib

2016-11-04 Thread Manish Tripathi
Hi I am trying to run the ML Binary Evaluation Classifier metrics to compare the rating with predicted values and get the AreaROC. My dataframe has two columns with rating as int (I have binarized it) and predicitions which is a float. When I pass it to the ML evaluator method I get an error as