I replaced -999.0 with 0. Predictions still have same label. Maybe negative feature really messes it up.
On Tue, Aug 19, 2014 at 4:51 PM, Xiangrui Meng <[email protected]> wrote: > The ratio should be okay. Could you try to pre-process the data and > map -999.0 to 0 before calling NaiveBayes? Btw, I added a check to > ensure nonnegative features values: > https://github.com/apache/spark/pull/2038 > > -Xiangrui > > On Tue, Aug 19, 2014 at 1:39 PM, Phuoc Do <[email protected]> wrote: > > Hi Xiangrui, > > > > Training data: 42945 "s" out of 124659. > > Test data: 42722 "s" out of 125341. > > > > The ratio is very much the same. I tried Decision Tree. It outputs 0 to 1 > > decimals. I don't quite understand it yet. > > > > Would feature scaling make it work for Naive Bayes? > > > > Phuoc Do > > > > > > On Tue, Aug 19, 2014 at 12:51 AM, Xiangrui Meng <[email protected]> > wrote: > >> > >> What is the ratio of examples labeled `s` to those labeled `b`? Also, > >> Naive Bayes doesn't work on negative feature values. It assumes term > >> frequencies as the input. We should throw an exception on negative > >> feature values. -Xiangrui > >> > >> On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do <[email protected]> wrote: > >> > I'm trying Naive Bayes classifier for Higg Boson challenge on Kaggle: > >> > > >> > http://www.kaggle.com/c/higgs-boson > >> > > >> > Here's the source code I'm working on: > >> > > >> > > >> > > https://github.com/dnprock/SparkHiggBoson/blob/master/src/main/scala/KaggleHiggBosonLabel.scala > >> > > >> > Training data looks like this: > >> > > >> > > >> > > 100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,s > >> > > >> > > >> > > 100001,160.937,68.768,103.235,48.146,-999,-999,-999,3.473,2.078,125.157,0.879,1.414,-999,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999,-999,-999,46.226,b > >> > > >> > My problem is Naive Bayes classifier always outputs 0 (for "b") for > all > >> > test > >> > data. I appreciate any help. > >> > > >> > -- > >> > Phuoc Do > >> > https://vida.io/dnprock > > > > > > > > > > -- > > Phuoc Do > > https://vida.io/dnprock > -- Phuoc Do https://vida.io/dnprock
