Re: Error and doubts in using Mllib Naive bayes for text clasification

Rahul Bhojwani Tue, 08 Jul 2014 00:03:09 -0700

I am really sorry. Its actually my mistake. My problem 2 is wrong because
using a single feature is a senseless thing. Sorry for the inconvenience.
But still I will be waiting for the solutions for problem 1 and 3.


Thanks,


On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com
> wrote:

> Hello,
>
> I am a novice.I want to classify the text into two classes. For this
> purpose I  want to use Naive Bayes model. I am using Python for it.
>
> Here are the problems I am facing:
>
> *Problem 1:* I wanted to use all words as features for the bag of words
> model. Which means my features will be count of individual words. In this
> case whenever a new word comes in the test data (which was never present in
> the train data) I need to increase the size of the feature vector to
> incorporate that word as well. Correct me if I am wrong. Can I do that in
> the present Mllib NaiveBayes. Or what is the way in which I can incorporate
> this?
>
> *Problem 2:* As I was not able to proceed with all words I did some
> pre-processing and figured out few features from the text. But using this
> also is giving errors.
> Right now I was testing for only one feature from the text that is count
> of positive words. I am submitting the code below, along with the error:
>
>
> #############Code
>
> import tokenizer
> import gettingWordLists as gl
> from pyspark.mllib.classification import NaiveBayes
> from numpy import array
> from pyspark import SparkContext, SparkConf
>
> conf = (SparkConf().setMaster("local[6]").setAppName("My
> app").set("spark.executor.memory", "1g"))
>
> sc=SparkContext(conf = conf)
>
> # Getting the positive dict:
> pos_list = []
> pos_list = gl.getPositiveList()
> tok = tokenizer.Tokenizer(preserve_case=False)
>
>
> train_data  = []
>
> with open("training_file.csv","r") as train_file:
>     for line in train_file:
>         tokens = line.split(",")
>         msg = tokens[0]
>         sentiment = tokens[1]
>         count = 0
>         tokens = set(tok.tokenize(msg))
>         for i in tokens:
>             if i.encode('utf-8') in pos_list:
>                 count+=1
>         if sentiment.__contains__('NEG'):
>             label = 0.0
>         else:
>             label = 1.0
>         feature = []
>         feature.append(label)
>         feature.append(float(count))
>         train_data.append(feature)
>
>
> model = NaiveBayes.train(sc.parallelize(array(train_data)))
> print model.pi
> print model.theta
> print "\n\n\n\n\n" , model.predict(array([5.0]))
>
> ##############
>
>
> *This is the output:*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last):
> File "naive_bayes_analyser.py", line 77, in <module>     print "\n\n\n\n\n"
> , model.predict(array([5.0]))  File
> "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
> line 101, in predict    return numpy.argmax(self.pi + dot(x, self.theta))
> ValueError: matrices are not aligned*
>
> ##############
>
> *Problem 3*: As you can see the output for model.pi is -ve. That is prior
> probabilities are negative. Can someone explain that also. Is it the log of
> the probability?
>
>
>
> Thanks,
> --
> Rahul K Bhojwani
> 3rd Year B.Tech
> Computer Science and Engineering
> National Institute of Technology, Karnataka
>



-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Re: Error and doubts in using Mllib Naive bayes for text clasification

Reply via email to