Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder and Evaluator to empirically choose the model hyper parameters (ie. numFeatures) per the following: http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation http://spark.apache.org/docs/

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher than vocab size will tend t

Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance of hash collisions, but it's not strictly necessary - it becomes a memory / accuracy trade off. Surprisingly, the impact on model performance of moderate hash collisions is often not significant. So it may b

How to specify the numFeatures in HashingTF

2015-10-15 Thread Jianguo Li
Hi, There is a parameter in the HashingTF called "numFeatures". I was wondering what is the best way to set the value to this parameter. In the use case of text categorization, do you need to know in advance the number of words in your vocabulary? or do you set it to be a large value, greater than