You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder and Evaluator to empirically choose the model hyper parameters (ie. numFeatures) per the following:
http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split On Fri, Jan 1, 2016 at 7:48 AM, Yanbo Liang <yblia...@gmail.com> wrote: > You can refer the following code snippet to set numFeatures for HashingTF: > > val hashingTF = new HashingTF() > .setInputCol("words") > .setOutputCol("features") > .setNumFeatures(n) > > > 2015-10-16 0:17 GMT+08:00 Nick Pentreath <nick.pentre...@gmail.com>: > >> Setting the numfeatures higher than vocab size will tend to reduce the >> chance of hash collisions, but it's not strictly necessary - it becomes a >> memory / accuracy trade off. >> >> Surprisingly, the impact on model performance of moderate hash collisions >> is often not significant. >> >> So it may be worth trying a few settings out (lower than vocab, higher >> etc) and see what the impact is on evaluation metrics. >> >> — >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li <flyingfromch...@gmail.com> >> wrote: >> >>> Hi, >>> >>> There is a parameter in the HashingTF called "numFeatures". I was >>> wondering what is the best way to set the value to this parameter. In the >>> use case of text categorization, do you need to know in advance the number >>> of words in your vocabulary? or do you set it to be a large value, greater >>> than the number of words in your vocabulary? >>> >>> Thanks, >>> >>> Jianguo >>> >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com