I am really sorry. Its actually my mistake. My problem 2 is wrong because using a single feature is a senseless thing. Sorry for the inconvenience. But still I will be waiting for the solutions for problem 1 and 3.
Thanks, On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com > wrote: > Hello, > > I am a novice.I want to classify the text into two classes. For this > purpose I want to use Naive Bayes model. I am using Python for it. > > Here are the problems I am facing: > > *Problem 1:* I wanted to use all words as features for the bag of words > model. Which means my features will be count of individual words. In this > case whenever a new word comes in the test data (which was never present in > the train data) I need to increase the size of the feature vector to > incorporate that word as well. Correct me if I am wrong. Can I do that in > the present Mllib NaiveBayes. Or what is the way in which I can incorporate > this? > > *Problem 2:* As I was not able to proceed with all words I did some > pre-processing and figured out few features from the text. But using this > also is giving errors. > Right now I was testing for only one feature from the text that is count > of positive words. I am submitting the code below, along with the error: > > > #############Code > > import tokenizer > import gettingWordLists as gl > from pyspark.mllib.classification import NaiveBayes > from numpy import array > from pyspark import SparkContext, SparkConf > > conf = (SparkConf().setMaster("local[6]").setAppName("My > app").set("spark.executor.memory", "1g")) > > sc=SparkContext(conf = conf) > > # Getting the positive dict: > pos_list = [] > pos_list = gl.getPositiveList() > tok = tokenizer.Tokenizer(preserve_case=False) > > > train_data = [] > > with open("training_file.csv","r") as train_file: > for line in train_file: > tokens = line.split(",") > msg = tokens[0] > sentiment = tokens[1] > count = 0 > tokens = set(tok.tokenize(msg)) > for i in tokens: > if i.encode('utf-8') in pos_list: > count+=1 > if sentiment.__contains__('NEG'): > label = 0.0 > else: > label = 1.0 > feature = [] > feature.append(label) > feature.append(float(count)) > train_data.append(feature) > > > model = NaiveBayes.train(sc.parallelize(array(train_data))) > print model.pi > print model.theta > print "\n\n\n\n\n" , model.predict(array([5.0])) > > ############## > > > *This is the output:* > > > > > > > > > > > > > > > *[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last): > File "naive_bayes_analyser.py", line 77, in <module> print "\n\n\n\n\n" > , model.predict(array([5.0])) File > "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py", > line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta)) > ValueError: matrices are not aligned* > > ############## > > *Problem 3*: As you can see the output for model.pi is -ve. That is prior > probabilities are negative. Can someone explain that also. Is it the log of > the probability? > > > > Thanks, > -- > Rahul K Bhojwani > 3rd Year B.Tech > Computer Science and Engineering > National Institute of Technology, Karnataka > -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka