You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder
and Evaluator to empirically choose the model hyper parameters (ie.
numFeatures) per the following:
http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
http://spark.apache.org/docs/
You can refer the following code snippet to set numFeatures for HashingTF:
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(n)
2015-10-16 0:17 GMT+08:00 Nick Pentreath :
> Setting the numfeatures higher than vocab size will tend t
Setting the numfeatures higher than vocab size will tend to reduce the chance
of hash collisions, but it's not strictly necessary - it becomes a memory /
accuracy trade off.
Surprisingly, the impact on model performance of moderate hash collisions is
often not significant.
So it may b
Hi,
There is a parameter in the HashingTF called "numFeatures". I was wondering
what is the best way to set the value to this parameter. In the use case of
text categorization, do you need to know in advance the number of words in
your vocabulary? or do you set it to be a large value, greater than