You can refer the following code snippet to set numFeatures for HashingTF:

val hashingTF = new HashingTF()
      .setInputCol("words")
      .setOutputCol("features")
      .setNumFeatures(n)


2015-10-16 0:17 GMT+08:00 Nick Pentreath <nick.pentre...@gmail.com>:

> Setting the numfeatures higher than vocab size will tend to reduce the
> chance of hash collisions, but it's not strictly necessary - it becomes a
> memory / accuracy trade off.
>
> Surprisingly, the impact on model performance of moderate hash collisions
> is often not significant.
>
> So it may be worth trying a few settings out (lower than vocab, higher
> etc) and see what the impact is on evaluation metrics.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li <flyingfromch...@gmail.com>
> wrote:
>
>> Hi,
>>
>> There is a parameter in the HashingTF called "numFeatures". I was
>> wondering what is the best way to set the value to this parameter. In the
>> use case of text categorization, do you need to know in advance the number
>> of words in your vocabulary? or do you set it to be a large value, greater
>> than the number of words in your vocabulary?
>>
>> Thanks,
>>
>> Jianguo
>>
>
>

Reply via email to