Re: LDA topic Modeling spark + python

Bryan Cutler Mon, 29 Feb 2016 10:51:07 -0800

The input into LDA.train needs to be an RDD of a list with the first
element an integer (id) and the second a pyspark.mllib.Vector object
containing real numbers (term counts), i.e. an RDD of [doc_id,
vector_of_counts].


>From your example, it looks like your corpus is a list with an zero-based
id, with the second element a tuple of user id and list of lines from the
data that have that user_id, something like [doc_id, (user_id, [line0,
line1])]

You need to make that element a Vector containing real numbers somehow.

On Sun, Feb 28, 2016 at 11:08 PM, Mishra, Abhishek <
abhishek.mis...@xerox.com> wrote:

> Hello Bryan,
>
>
>
> Thank you for the update on Jira. I took your code and tried with mine.
> But I get an error with the vector being created. Please see my code below
> and suggest me.
>
> My input file has some contents like this:
>
> "user_id","status"
>
> "0026c10bbbc7eeb55a61ab696ca93923","http:////
> www.youtube.com//watch?v=n3nPiBai66M&feature=related **bobsnewline**
> tiftakar, Trudy Darmanin  <3?"
>
> "0026c10bbbc7eeb55a61ab696ca93923","Brandon Cachia ,All I know is
> that,you're so nice."
>
> "0026c10bbbc7eeb55a61ab696ca93923","Melissa Zejtunija:HAM AND CHEESE BIEX
> INI??? **bobsnewline**  Kirr:bit tigieg mel **bobsnewline**  Melissa
> Zejtunija :jaq le mandix aptit tigieg **bobsnewline**  Kirr:int bis
> serjeta?"
>
> "0026c10bbbc7eeb55a61ab696ca93923",".........Where is my mind?????"
>
>
>
> And what I am doing in my code is like this:
>
>
>
> import string
>
> from pyspark.sql import SQLContext
>
> from pyspark import SparkConf, SparkContext
>
> from pyspark.sql import SQLContext
>
> from pyspark.mllib.clustering import LDA, LDAModel
>
> from nltk.tokenize import word_tokenize
>
> from stop_words import get_stop_words
>
> from nltk.stem.porter import PorterStemmer
>
> from gensim import corpora, models
>
> import gensim
>
> import textmining
>
> import pandas as pd
>
> conf = SparkConf().setAppName("building a warehouse")
>
> sc = SparkContext(conf=conf)
>
> sql_sc = SQLContext(sc)
>
> data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
>
> header = data.first() #extract header
>
> print header
>
> data = data.filter(lambda x:x !=header)    #filter out header
>
> pairs = data.map(lambda x: (x.split(',')[0], x))#.collect()#generate pair
> rdd key value
>
> #data11=data.subtractByKey(header)
>
> #print pairs.collect()
>
> #grouped=pairs.map(lambda (x,y): (x, [y])).reduceByKey(lambda a, b: a + b)
>
> grouped=pairs.groupByKey()#grouping values as per key
>
> #print grouped.collectAsMap()
>
> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
>
> #rr=grouped_val.map(lambda (x,y):(x,[y]))
>
> #df_grouped_val=sql_sc.createDataFrame(rr, ["user_id", "status"])
>
> #print list(enumerate(grouped_val))
>
> #corpus = grouped.zipWithIndex().map(lambda x: [x[1],
> x[0]]).cache()#.collect()
>
> corpus = grouped.zipWithIndex().map(lambda (term_counts, doc_id): [doc_id,
> term_counts]).cache()
>
> #corpus.cache()
>
> model = LDA.train(corpus, k=10, maxIterations=10, optimizer="online")
>
> #ldaModel = LDA.train(corpus, k=3)
>
> print corpus
>
> topics = model.describeTopics(3)
>
> print("\"topic\", \"termIndices\", \"termWeights\"")
>
> for i, t in enumerate(topics):
>
>    print("%d, %s, %s" % (i, str(t[0]), str(t[1])))
>
>
>
> sc.stop()
>
>
>
>
>
> Please help me in this
>
> Abhishek
>
>
>
> *From:* Bryan Cutler [mailto:cutl...@gmail.com]
> *Sent:* Friday, February 26, 2016 4:17 AM
> *To:* Mishra, Abhishek
> *Cc:* user@spark.apache.org
> *Subject:* Re: LDA topic Modeling spark + python
>
>
>
> I'm not exactly sure how you would like to setup your LDA model, but I
> noticed there was no Python example for LDA in Spark.  I created this issue
> to add it https://issues.apache.org/jira/browse/SPARK-13500.  Keep an eye
> on this if it could be of help.
>
> bryan
>
>
>
> On Wed, Feb 24, 2016 at 8:34 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
> Hello All,
>
>
>
> If someone has any leads on this please help me.
>
>
>
> Sincerely,
>
> Abhishek
>
>
>
> *From:* Mishra, Abhishek
> *Sent:* Wednesday, February 24, 2016 5:11 PM
> *To:* user@spark.apache.org
> *Subject:* LDA topic Modeling spark + python
>
>
>
> Hello All,
>
>
>
>
>
> I am doing a LDA model, please guide me with something.
>
>
>
> I have a csv file which has two column "user_id" and "status". I have to
> generate a word-topic distribution after aggregating the user_id. Meaning
> to say I need to model it for users on their grouped status. The topic
> length being 2000 and value of k or number of words being 3.
>
>
>
> Please, if you can provide me with some link or some code base on spark
> with python ; I would be grateful.
>
>
>
>
>
> Looking forward for a  reply,
>
>
>
> Sincerely,
>
> Abhishek
>
>
>
>
>

Re: LDA topic Modeling spark + python

Reply via email to