I assume that all examples do actually fall into exactly one of the classes.
If you always have to make a prediction then you always take the most probable class. If you can choose to make no classification for lack of confidence, yes you want to pick a per-class threshold and take the most likely class from among those that exceed the threshold. You would have to quantify the cost of making no classification versus the cost of making the wrong one, for each class, and pick the threshold that equalizes them. On Nov 20, 2014 6:43 AM, "jatinpreet" <jatinpr...@gmail.com> wrote: > I have been trying the Naive Baye's implementation of Spark's MLlib.During > testing phase, I wish to eliminate data with low confidence of prediction. > > My data set primarily consists of form based documents like reports and > application forms. They contain key-value pair type text and hence I assume > the independence condition holds better than with natural language. > > About the quality of priors, I am not doing anything special. I am training > more or less equal number of samples for each class and have left the heavy > lifting to be done by MLlib. > > Given these facts, does it make sense to have confidence thresholds defined > for each category above which I will get correct results consistently? > > Thanks > Jatin > > > > ----- > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >