You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n)
2015-10-16 0:17 GMT+08:00 Nick Pentreath <nick.pentre...@gmail.com>: > Setting the numfeatures higher than vocab size will tend to reduce the > chance of hash collisions, but it's not strictly necessary - it becomes a > memory / accuracy trade off. > > Surprisingly, the impact on model performance of moderate hash collisions > is often not significant. > > So it may be worth trying a few settings out (lower than vocab, higher > etc) and see what the impact is on evaluation metrics. > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li <flyingfromch...@gmail.com> > wrote: > >> Hi, >> >> There is a parameter in the HashingTF called "numFeatures". I was >> wondering what is the best way to set the value to this parameter. In the >> use case of text categorization, do you need to know in advance the number >> of words in your vocabulary? or do you set it to be a large value, greater >> than the number of words in your vocabulary? >> >> Thanks, >> >> Jianguo >> > >