The input into LDA.train needs to be an RDD of a list with the first element an integer (id) and the second a pyspark.mllib.Vector object containing real numbers (term counts), i.e. an RDD of [doc_id, vector_of_counts].
>From your example, it looks like your corpus is a list with an zero-based id, with the second element a tuple of user id and list of lines from the data that have that user_id, something like [doc_id, (user_id, [line0, line1])] You need to make that element a Vector containing real numbers somehow. On Sun, Feb 28, 2016 at 11:08 PM, Mishra, Abhishek < abhishek.mis...@xerox.com> wrote: > Hello Bryan, > > > > Thank you for the update on Jira. I took your code and tried with mine. > But I get an error with the vector being created. Please see my code below > and suggest me. > > My input file has some contents like this: > > "user_id","status" > > "0026c10bbbc7eeb55a61ab696ca93923","http://// > www.youtube.com//watch?v=n3nPiBai66M&feature=related **bobsnewline** > tiftakar, Trudy Darmanin <3?" > > "0026c10bbbc7eeb55a61ab696ca93923","Brandon Cachia ,All I know is > that,you're so nice." > > "0026c10bbbc7eeb55a61ab696ca93923","Melissa Zejtunija:HAM AND CHEESE BIEX > INI??? **bobsnewline** Kirr:bit tigieg mel **bobsnewline** Melissa > Zejtunija :jaq le mandix aptit tigieg **bobsnewline** Kirr:int bis > serjeta?" > > "0026c10bbbc7eeb55a61ab696ca93923",".........Where is my mind?????" > > > > And what I am doing in my code is like this: > > > > import string > > from pyspark.sql import SQLContext > > from pyspark import SparkConf, SparkContext > > from pyspark.sql import SQLContext > > from pyspark.mllib.clustering import LDA, LDAModel > > from nltk.tokenize import word_tokenize > > from stop_words import get_stop_words > > from nltk.stem.porter import PorterStemmer > > from gensim import corpora, models > > import gensim > > import textmining > > import pandas as pd > > conf = SparkConf().setAppName("building a warehouse") > > sc = SparkContext(conf=conf) > > sql_sc = SQLContext(sc) > > data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv') > > header = data.first() #extract header > > print header > > data = data.filter(lambda x:x !=header) #filter out header > > pairs = data.map(lambda x: (x.split(',')[0], x))#.collect()#generate pair > rdd key value > > #data11=data.subtractByKey(header) > > #print pairs.collect() > > #grouped=pairs.map(lambda (x,y): (x, [y])).reduceByKey(lambda a, b: a + b) > > grouped=pairs.groupByKey()#grouping values as per key > > #print grouped.collectAsMap() > > grouped_val= grouped.map(lambda x : (list(x[1]))).collect() > > #rr=grouped_val.map(lambda (x,y):(x,[y])) > > #df_grouped_val=sql_sc.createDataFrame(rr, ["user_id", "status"]) > > #print list(enumerate(grouped_val)) > > #corpus = grouped.zipWithIndex().map(lambda x: [x[1], > x[0]]).cache()#.collect() > > corpus = grouped.zipWithIndex().map(lambda (term_counts, doc_id): [doc_id, > term_counts]).cache() > > #corpus.cache() > > model = LDA.train(corpus, k=10, maxIterations=10, optimizer="online") > > #ldaModel = LDA.train(corpus, k=3) > > print corpus > > topics = model.describeTopics(3) > > print("\"topic\", \"termIndices\", \"termWeights\"") > > for i, t in enumerate(topics): > > print("%d, %s, %s" % (i, str(t[0]), str(t[1]))) > > > > sc.stop() > > > > > > Please help me in this > > Abhishek > > > > *From:* Bryan Cutler [mailto:cutl...@gmail.com] > *Sent:* Friday, February 26, 2016 4:17 AM > *To:* Mishra, Abhishek > *Cc:* user@spark.apache.org > *Subject:* Re: LDA topic Modeling spark + python > > > > I'm not exactly sure how you would like to setup your LDA model, but I > noticed there was no Python example for LDA in Spark. I created this issue > to add it https://issues.apache.org/jira/browse/SPARK-13500. Keep an eye > on this if it could be of help. > > bryan > > > > On Wed, Feb 24, 2016 at 8:34 PM, Mishra, Abhishek < > abhishek.mis...@xerox.com> wrote: > > Hello All, > > > > If someone has any leads on this please help me. > > > > Sincerely, > > Abhishek > > > > *From:* Mishra, Abhishek > *Sent:* Wednesday, February 24, 2016 5:11 PM > *To:* user@spark.apache.org > *Subject:* LDA topic Modeling spark + python > > > > Hello All, > > > > > > I am doing a LDA model, please guide me with something. > > > > I have a csv file which has two column "user_id" and "status". I have to > generate a word-topic distribution after aggregating the user_id. Meaning > to say I need to model it for users on their grouped status. The topic > length being 2000 and value of k or number of words being 3. > > > > Please, if you can provide me with some link or some code base on spark > with python ; I would be grateful. > > > > > > Looking forward for a reply, > > > > Sincerely, > > Abhishek > > > > >