Hi, Fernando, to get a better understanding of correlation, you could think of features as events in probability, then if the probability of the intersection is high, the events are high correlated...
I agree with Ted. But usually, naive bayes works well with text classification when you have a good pre-processing phase, using pca, tf-idf or lda... Are you doing any pre-processing? On Dec 8, 2013 3:25 PM, "Ted Dunning" <[email protected]> wrote: > > The problem of correlation of features is clearly present in text, but it > is not so clear what the effect will be. For naive bayes this has the > effect of making the classifier over confident but it usually still works > reasonably well. For logistic regression without regularization it can > cause the learning algorithm to fail (mahout'so logistic regression is > regularized, btw). > > Empirical evidence dominates theory in this situation. > > Sent from my iPhone > > > On Dec 8, 2013, at 9:14, Fernando Santos <[email protected]> > wrote: > > > > Now just a theoretical doubt. In a text classification example, what > would > > it mean to have features that are high correlated? I mean, in this case > > our features are basically words, do you have an example of how these > > features can not be independant? This concept is not really clear in my > > mind... >
